On model averaging the coefficients of linear models

a shorter argument based on a specific example is here

“What model averaging does not mean is averaging parameter estimates, because parameters in different models have different meanings and should not be averaged, unless you are sure you are in a special case in which it is safe to do so.” – Richard McElreath, p. 196 of the textbook I wish I had learned from Statistical Rethinking

This is an infrequent but persistent criticism of model-averaged coefficients in the applied statistics literature on model averaging. I have heard from reviewers of a manuscript that it is infrequent only because it is probably common knowledge among statisticians – I’m not aware of actual data. In most sources, the criticism is like McElreath’s, one sentence without any further explanation. Three recent papers in the ecology literature have expanded explanations (Banner and Higgs 2017, Cade 2015, Dormann et al. 2018).

McElreath doesn’t give any hints what a “special case” looks like and I am not aware of any defense of model-averaged coefficients more generally, other than the extremely brief response to Draper’s equally brief comment in Hoeting et al. 1999.

Regardless, I think the “different meanings” criticism is wrong, and even obviously wrong, but the recognition of this has been masked by confused thinking about the parameters of a linear model. The confusion arises because of incorrectly equating the definition of the coefficients of a linear model and the definition of the parameters of a linear model.

A coefficient of a linear model is the difference in the modeled value given a one unit difference in the predictor,

\[\begin{equation} b_j.k = (\hat{Y}|X_j=x+1, X_k=x_k) - (\hat{Y}|X_j=x, X_k=x_k) \end{equation}\]

For generalized linear models, the modeled values are on the link scale. Because the modeled value of a linear model is a conditional mean, the coefficient is a difference in conditional means and therefore, is conditional on the additional covariates in the model (\(X_k\)).

Here is where the confusion arises – A linear model coefficient estimates two different parameters

Regression parameter, which is a function of probablistic conditioning \[\begin{equation} \theta_{j.k} = E[Y|X_j=x+1, X_k=x_k] - E[Y|X_j=x, X_k=x_k] \end{equation}\]
Effect parameter, which is a function of causal conditioning \[\begin{equation} \beta_j = E((Y_i | {X_j=x+1}) - (Y_i | {X_j=x})) \end {equation}\]

\[\begin{equation} \beta_j = E(Y | do(X_j=x+1)) - E(Y | do(X_j=x)) \end{equation}\]

Confusion arises when the definition of the linear model coefficient is thought to be the defintion of the parameter that is estimated. This is only true for the regression parameter.

The regression parameter \(\theta_{j.k}\) is conditional on the covariates \(X_k\). The “true” model is the model specified by the researcher – there is no “generating” model. The model, and the model parameters, change as the researcher adds or deletes independent variables. Shalizi refers to this as “probabalistic conditioning” (p. 505 of 01/30/2017 edition).

I give two, equivalent definitions of the effect parameter. The first is the counterfactual definition, developed by Rubin. It is the average of individual causal effects, where an individual causal effect is a potential outcome – what would happen if we could measure the response in individual \(i\) under two conditions (\(x\) and \(x + 1\)) with only \(X_j\) having changed. The second is an interventional definition based on the \(do\) operator, developed by Pearl. The \(do\) operator represents what would happen in a hypothetical intervention that modifies \(X_j\) but leaves all other variables unchanged.

The effect parameter \(\beta_j\) is the “generating” effect and, consequently, it is not conditional on other variables that also generate (or causally effect) the response. Shalizi refers to this as “causal conditioning.” The true model includes all of the variables that causally effect \(Y\). Causal graphs identify different sources of bias in the estimation of effect parameters and can be used to reduce the true model down to the set of variables necessary to estimate an effect without bias.

If our goal is mere description of the relationships among variables, the coefficients are estimating the regression parameter \(\theta_{j.k}\). Parameters from different models truly have different meanings because each specifically describes the relationship between \(Y\) and \(X_j\) conditional on a specific set of covariates. Averaging coefficients from different models would be averaging apples and oranges. Indeed, any sort of model selection would be comparing apples and oranges. Omitted variable bias and confounding are irrelevant or meaningless in the context of linear models as mere description, an important point that is missing from most statistics textbooks although see Gelman and Hill.

If our goal is to understand a system at a level that allows intervention (manipulation of variables) to generate a specific outcome, the \(b^[m]_j\) from each of the \(m\) nested submodels estimates the same effect parameter \(\beta_j\) and so have the same meaning. In any observational study, all of the models are wrong and each of the estimates \(b^[m]_j\) is biased by some unknown amount because of ommitted confounders. Model averaging the coefficients is a reasonable method for combining information from these models, all of which we know are wrong.

If our goal is prediction, the issue is moot, because in prediction we are generally not interested in the coefficients, unless we want to quantify something like variable importance. Parameters simply serve the purpose of mapping data to a prediction, they don’t have any meaning beyond this. If model-averaging the coefficients improves predictive performance, then model-average. And, for linear models, including generalized linear models, model-averaged predictions computed by averaging the predictions or averaging the coefficients are equivalent as long as the averaging is on the link scale.

This rebuttal to the claim that coefficients from linear models cannot be meaningfully averaged is not a defense of model averaging more generally. I am not advocating model averaging over alternative methods for causal modeling, I am simply arguing that coefficients from nested submodels can have the same meaning across models and, therefore, can be meaningfully averaged.

I am looking for a strong argument to my rebuttal. That means…

Arguing that the shrinkage property of model averaging is ad hoc and there are better methods (such as the family of penalized regression methods that include the lasso and ridge regression) that explicitly model the shrinkage parameter is not a argument against my rebuttal, only an argument for alternatives to model averaging.
Arguing that model selection and model averaging is mindless and careful selection of covariates is superior is not an argument against my rebuttal, only an argument against model selection and averaging more generally.
Arguing that model selection and model averaging is mindless and careful construction of a strucutural model based on prior knowledge is superior is not an argument against this rebuttal, only an argument against model selection and averaging more generally.
Arguing that model averaging was developed for model-averaged predictions or that most of the theory applies to model-averaged predictions is not an argument against this rebuttal, only an argument that we need better theoretical and empirical work on model-averaged coefficients.
Arguing that coefficients from non-linear models cannot be meaningfully averaged is not an argument against my rebuttal, only an argument on the limitation of model averaging coefficients (and again, contrary to Cade 2015 and Dormann et al. 2018, coefficients of nested submodels from generalized linear models, which are linear models, can be meaningfully averaged).

On model averaging the coefficients of linear models

R doodles. Some ecology. Some physiology. Much fake data.

How to make plots with factor levels below the x-axis (bench-biology style)

What is an interaction?

How to estimate synergism or antagonism

Type 3 ANOVA in R -- an easy way to publish wrong tables

Linear models with a covariate ("ANCOVA")

Normal Q-Q plots - what is the robust line and should we prefer it?

ANCOVA when the covariate is a mediator affected by treatment

Bootstrap confidence intervals when sample size is really small

What is the consequence of normalizing by each case in the control?

Melting a list of columns