Should the model-averaged prediction be computed on the link or response scale in a GLM?

[updated to include additional output from MuMIn, BMA, and BAS]

This post is a follow up to my inital post, which was written as as a way for me to pen my mental thoughts on the recent review of “Model averaging in ecology: a review of Bayesian, information‐theoretic and tactical approaches for predictive inference”. It was also written without contacting and discussing the issue with the authors. This post benefits from a series of e-mails with the lead author Carsten Dormann and the last author Florian Hartig.

The Dormann et al. paper focuses on model-averaged predictions, but has a short discussion on problems with model-averaged coefficients in the supplement. It is in the supplement, that the authors state that for generalized linear models (GLMs) “coefficient averaging is not equivalent to prediction averaging”. Brian Cade¹ also makes this argument.

In my previous post, I argued that this statement is wrong – predictions from a parameter-averaged model is mathematically identical to averaging predictions. It was hard for me to understand how this was not immediately obvious until my e-mail exchange with Carsten and Florian. In short, I assume that all averaging is done on the link scale and then the predictions are back-transformed to the response scale. Carsten and Florian (and Brian Cade?) argue that this is the “wrong” way to compute averaged predictions.

The summary of our differences is

Carsten and Florian argue that predictions should be computed on the link scale, back-transformed to the response scale, and then averaged on the response scale. I understand their reason to be that first, this is how everyone outside of ecology does it, and second, predictions have to be averaged on the response scale for non-linear models because there is no link scale, and, since, GLMs are non-linear models, they should be averaged on the response scale.
I argue that because a GLM model is a function of a linear predictor, any subsequent averaging should be on the linear scale, so that additive relationships remain additive and not multiplicative. Averaging on the link (linear) scale maintains consistency with the meaning of the fit parameters.

[update] Perhaps the reason for averaging on the response scale, at least in a Bayesian framework, is eq. 1 of Hoeting et al. 1999, which is

\[\begin{equation} \textrm{pr}(\Delta | D) = \sum_{k=1}^K \textrm{pr}(\Delta | M_k, D) \textrm{pr}(M_k | D) \end{equation}\]

where \(\Delta\) is a prediction. For a GLM, these \(\Delta\) are on the response scale because the models are \(\textrm{E}(Y) = \mu = g^{-1}(X\beta)\).

Some final thoughts using equation S2 from Dormann et al. 2018

\[\begin{equation} \frac{1}{m} \sum_{i=1}^m{g^{-1}(Xb_i)} \ne g^{-1}(X \frac{\sum_{i=1}^m{b_i}}{m}) \end{equation}\]

Dormann et al. advocate averaging using the LHS of eq. S2, I advocate using the RHS.
If using the RHS of S2 to average predictors, then averaging the predictors or computing the predictor from averaged coefficients are mathematically equivalent.
GLMs are unlike non-linear models in that non-linear models do not have link functions although some can be linearized. There is no linear model that is fit. Consquently, for non-linear models, averging the predictors of non-linear models on the “response scale” is consistent with the fit model. So in my opinion, this isn’t good justification for averaging GLMs on the response scale.
Because Carsten and Florian (and presumably the other co-authors of Dormann et al.) argue that predictions should be model averaged on the response scale, this raises questions about the default output for MuMIn and BAS which default to predictions on the link scale that were averaged on the link scale (both do produce predictions averaged on the response scale with type=‘response’). This raises one question and one concern. What “good” is a prediction averaged on the link scale – the default output of MuMIn and BAS? If we want a ma-prediction on the link scale, we can get this two ways: log transform the ma-predictions that were averaged on the response scale or simply average on the link scale. The second way does not produce predictions that are direct transformations with those on the response scale, so which is the correct way?
The concern raised above. The default prediction (if “type=” is not specified) for the MuMIn or BAS package is the prediction averaged on the link scale. Probably most researchers would want to interpret predictions on the response scale and the team that publishes this may achieve this by, perhaps naively, simply back-transforming the default predictions. This is the RHS of eq. S2. Someone comes along to reproduce the results, takes the same data, and specifies type=’response’. They get slightly different results, because they’ve used the LHS of eq. S1.
I think this argument from Russell Lenth, the author of the amazingly useful emmeans (formerly lsmeans) package, supports my argument for averaging on the link scale. Here, Lenth comments on why the package computes marginal means on the link and not response scale for GLMs:

The model is our best guide

This choice of timing is based on the idea that the model is right. In particular, the fact that the response is transformed suggests that the transformed scale is the best scale to be working with. In addition, the model specifies that the effects of source and percent are linear on the transformed scale; inasmuch as marginal averaging to obtain EMMs is a linear operation, that averaging is best done on the transformed scale. For those two good reasons, back-transforming to the response scale is delayed until the very end by default."

The difference in predicted values computed on the link vs. response scale is trivially small for the example below, and perhaps in most real examples, and if this is the case, our argument doesn’t really matter. There are much bigger sources of error in modeling than this.
Finally, in response to an exchange with Florian, I wanted to make sure that my understanding of MuMIn is correct (it seems to be), so below are five ways to compute a simple count (poisson) example using fake data in addition to the computation from MuMIn.
Like MuMIn, The BAS package defaults to averaging on the link scale without back-transformation to the response scale – see the code at the bottom. Unlike MuMIn, BAS outputs the predictions on the response scale averaged on the response scale – these are not equal to the back-transformation of the default predictions on the link scale. I couldn’t find this documented.
I also explore the BMA and AICcmodelavg packages.

How does MuMIn model average predictions?

library(ggplot2)
library(MuMIn)
library(BMA)

## Loading required package: survival

## Loading required package: leaps

## Loading required package: robustbase

## 
## Attaching package: 'robustbase'

## The following object is masked from 'package:survival':
## 
##     heart

## Loading required package: inline

## Loading required package: rrcov

## Scalable Robust Estimators with High Breakdown Point (version 1.4-7)

library(BAS)
library(AICcmodavg)

## 
## Attaching package: 'AICcmodavg'

## The following objects are masked from 'package:MuMIn':
## 
##     AICc, DIC, importance

library(data.table)

A simple model of counts

  set.seed(1)
  n <- 100
  exp_beta_0 <- 175 # mean Y on response scale
  exp_beta_1 <- 0.99 # effect on response scale
  exp_beta_2 <- 0.99 # effect on response sacle
  
  # create correlated x1 and x2 due to common factor z
  z <- rnorm(n) # common factor to correlate X1 and X2
  r <- 0.6 # correlation between x1 and x2
  alpha <- sqrt(r) # "effect" of Z on X1 and X2
  x1 <- alpha*z + sqrt(1-r)*rnorm(n)
  x2 <- alpha*z + sqrt(1-r)*rnorm(n)

  # expected count in link space
  E_log_count <- log(exp_beta_0) + log(exp_beta_1)*x1 + log(exp_beta_2)*x2 # expected log count
  # observed counts
  count <- rpois(n=n, lambda=exp(E_log_count))

  # create data.table and fit 
  dt <- data.table(count=count, x1=x1, x2=x2)
  fit <- glm(count ~ x1 + x2, family=poisson(link = "log"), data=dt,na.action=na.fail )
  X <- model.matrix(fit)
  
  # all model regression using MuMIn
  fit.mm <- dredge(fit)

## Fixed term is "(Intercept)"

  model_set <- get.models(fit.mm, subset=TRUE) # all models
  fit.avg <- model.avg(model_set) # coeffcients are on link scale
  fit.avg

## 
## Call:
## model.avg(object = model_set)
## 
## Component models: 
## '2'      '12'     '1'      '(Null)'
## 
## Coefficients: 
##        (Intercept)          x2           x1
## full      5.168167 -0.02107276 -0.002326521
## subset    5.168167 -0.02288317 -0.007265122

  # (0) MuMIn predict
  yhat.default <- predict(fit.avg)
  yhat.MuMIn1 <- predict(fit.avg, type='response')
  yhat.MuMIn2 <- exp(predict(fit.avg, type='link'))
  yhat.MuMIn3 <- predict(fit.avg, backtransform=TRUE)
  yhat.MuMIn4 <- exp(predict(fit.avg, backtransform=FALSE))
  head(data.table(default=exp(yhat.default),
                  MuMIn1=yhat.MuMIn1, 
                  MuMIn2=yhat.MuMIn2,
                  MuMIn3=yhat.MuMIn3,
                  MuMIn4=yhat.MuMIn4
                  ))

##     default   MuMIn1   MuMIn2   MuMIn3   MuMIn4
## 1: 176.7927 176.7934 176.7927 176.7927 176.7927
## 2: 171.1034 171.1072 171.1034 171.1034 171.1034
## 3: 174.7764 174.7805 174.7764 174.7764 174.7764
## 4: 171.3024 171.3036 171.3024 171.3024 171.3024
## 5: 180.1184 180.1233 180.1184 180.1184 180.1184
## 6: 171.9407 171.9422 171.9407 171.9407 171.9407

  #is this averaged on link or response scale? And is it the coefficients or the prediction that is averaged?
  
  # (1) average coefficients on link scale. compute prediction on link scale. transform predictions to response scale
  b <- fit.avg$coefficients['full',][colnames(X)]
  yhat1 <- exp((X%*%b)[,1]) #
  b_ma_link <- b

  # (2) compute predictions for each model on link scale. Average on link scale. Backtransform to response scale
  yhat2a <- exp(predict(fit.avg, backtransform=FALSE))
  w <- fit.mm$weight
  yhat2b.each_model.link_scale <- sapply(model_set, predict)
  yhat2b.link_scale <- (yhat2b.each_model.link_scale%*%w)[,1]
  yhat2b <- exp(yhat2b.link_scale)
  
  # (3) compute predictions for each model on link scale. Backtransform to response scale. Average on response scale. This is method of Dormann et al.
  yhat3.each_model.response_scale <- exp(yhat2b.each_model.link_scale)
  yhat3 <- (yhat3.each_model.response_scale%*%w)[,1]
  
  # (4) backtransform coefficients to response scale. Average coefficients on response scale. Compute prediction on response scale.
  B <- exp(fit.mm[,colnames(X)])
  B[is.na(B)] <- 0.0
  b_ma <- t(B)%*%w
  yhat4 <- (X%*%b_ma)[,1] #

  # (5) average coefficients on link scale. backtransform to response scale. compute prediction on response scale
  b <- exp(fit.avg$coefficients['full',][colnames(X)])
  yhat5 <- (X%*%b)[,1] #

Results

The first few rows of the results matrix of the predictions using five different methods for their computation.

##       MuMIn MuMIn.rs    yhat1    yhat2    yhat3    yhat4    yhat5
## 1: 176.7927 176.7934 176.7927 176.7927 176.7934 175.1100 174.4955
## 2: 171.1034 171.1072 171.1034 171.1034 171.1072 176.7358 176.9463
## 3: 174.7764 174.7805 174.7764 174.7764 174.7805 175.5243 174.7209
## 4: 171.3024 171.3036 171.3024 171.3024 171.3036 176.9411 177.9302
## 5: 180.1184 180.1233 180.1184 180.1184 180.1233 174.4711 174.2690
## 6: 171.9407 171.9422 171.9407 171.9407 171.9422 176.5958 176.9982

Column Keys

MuMIn = MuMIn’s default prediction backtransformed to response scale

MuMIn.rs = MuMIn with type = ‘response’

yhat1 = Coefficients averaged on link scale. Predictions computed on link scale from averaged coefficients and then backtransformed to response scale. My preferred method. RHS of eq. S2.

yhat2 = Predictions computed on link scale for each model and then averaged on link scale and then backtransformed to response scale. Alternative to my preferred method.

yhat3 = Predictions computed on link scale for each model and then backtransformed to response scale and then averaged on response scale. This is the method of Dormann et al. 2018 and Cade 2015

yhat4 = Coefficients backtransformed to response scale and then averaged on response scale. Predictions computed on response scale from averaged coefficients.

yhat5 = Coefficients averaged on link scale and then backtransformed to response scale, which are used to compute averaged predictions.

How does bic.glm model average predictions?

This simulation uses the fake data generated above.

fit.bma <- bic.glm(count ~ x1 + x2, glm.family=poisson(link = "log"), data=dt)
# (0) BMA predict
yhat.default <- predict(fit.bma, newdata=dt) # these are on the response scale

# (1) use model averaged B to get prediction on link scale. Backtransform to response scale (eq. s2 RHS of appendix)
b = fit.bma$postmean
Xb <- X%*%b
yhat1 <- exp(Xb) # averaged predictions backtransformed to response

res <- data.table(default=yhat.default, yhat1=yhat1)

How does the BAS package model-average predictions?

This simulation uses the fake data generated above.

  # fit using BMA
  packageVersion("BAS")

## [1] '1.5.3'

  # fit <- glm(count ~ x1 + x2, family=poisson(link = "log"), data=dt,na.action=na.fail )
  fit.bma <- bas.glm(count ~ x1 + x2, family=poisson(link = "log"), data=dt)
 
  res.bma.rs <- predict(fit.bma, type='response')
  res.bma0.rs <- res.bma.rs$fit
  res.bma1.rs.ls <- res.bma.rs$Ybma[,1]
  res.bma1.rs.exp.ls <- exp(res.bma1.rs.ls)
  
  res.bma.ls <- predict(fit.bma)
  res.bma0.ls <- res.bma.ls$fit
  res.bma1.ls.ls <- res.bma.ls$Ybma[,1]
  res.bma1.ls.exp.ls <- exp(res.bma1.ls.ls)
  
  
  bas.res1 <- data.table(
    default.fit=res.bma0.ls,
    default.Ybma=res.bma1.ls.ls,
    default.exp.Ybma=res.bma1.ls.exp.ls,
    response.fit=res.bma0.rs,
    response.Ybma=res.bma1.rs.ls,
    response.exp.Ybma=res.bma1.rs.exp.ls
  )
  
  res.bma <- predict(fit.bma)
  yhat.bma.response <- res.bma$fit
  yhat.bma.link <- res.bma$Ybma[,1]
  YHAT <- t(res.bma$Ypred)
  w <- res.bma$postprobs
  # averaged on response scale
  yhat1.bma.response <- (exp(YHAT)%*%w)[,1]
  # averaged on link scale and then backtransformed
  yhat2.bma.link <- (YHAT%*%w)[,1]
  yhat2.bma.response <- exp(yhat2.bma.link)

  bas.res2 <- data.table(yhat0=yhat.bma.response,
             yhat1=yhat1.bma.response,
             yhat2=yhat2.bma.response)

Result 1

predict.basglm seems to have the same output regardless of the type=‘response’ specification. The prediction on the link scale is in “Ybma” and the prediction on the response scale is in “fit”. Note that \(exp(Ybma) \ne fit\). s

head(bas.res1)

##    default.fit default.Ybma default.exp.Ybma response.fit response.Ybma
## 1:    176.3709     5.172571         176.3677     176.3709      5.172571
## 2:    173.3735     5.155378         173.3613     173.3735      5.155378
## 3:    175.6617     5.168528         175.6560     175.6617      5.168528
## 4:    172.8983     5.152634         172.8863     172.8983      5.152634
## 5:    177.4689     5.178703         177.4525     177.4689      5.178703
## 6:    173.5868     5.156636         173.5795     173.5868      5.156636
##    response.exp.Ybma
## 1:          176.3677
## 2:          173.3613
## 3:          175.6560
## 4:          172.8863
## 5:          177.4525
## 6:          173.5795

Result 2

\(fit\) is indeed the predictions averaged on the response scale and not \(exp(Ybma)\)

head(bas.res2)

##       yhat0    yhat1    yhat2
## 1: 176.3709 176.3709 176.3677
## 2: 173.3735 173.3735 173.3613
## 3: 175.6617 175.6617 175.6560
## 4: 172.8983 172.8983 172.8863
## 5: 177.4689 177.4689 177.4525
## 6: 173.5868 173.5868 173.5795

Key:

yhat0 - “fit”, which is the prediction on the response scale

yhat1 - my manual computation of the average on the response scale

yhat2 - my manual computation of the average on the link scale and then back-transformed to the response scale.

How does the AICcmodavg package model-average predictions?

data(min.trap)
##assign "UPLAND" as the reference level as in Mazerolle (2006)
min.trap$Type <- relevel(min.trap$Type, ref = "UPLAND")
##set up candidate models
Cand.mod <- list()
Cand.mod[[1]] <- glm(Num_anura ~ Type + log.Perimeter + Num_ranatra,
                     family = poisson, offset = log(Effort),
                     data = min.trap)
Cand.mod[[2]] <- glm(Num_anura ~ Type + log.Perimeter, family = poisson,
                     offset = log(Effort), data = min.trap)
Cand.mod[[3]] <- glm(Num_anura ~ Type + Num_ranatra, family = poisson,
                     offset = log(Effort), data = min.trap)
Cand.mod[[4]] <- glm(Num_anura ~ Type, family = poisson,
                     offset = log(Effort), data = min.trap)
Cand.mod[[5]] <- glm(Num_anura ~ log.Perimeter + Num_ranatra,
                     family = poisson, offset = log(Effort),
                     data = min.trap)

Cand.mod[[6]] <- glm(Num_anura ~ log.Perimeter, family = poisson,
                     offset = log(Effort), data = min.trap)
Cand.mod[[7]] <- glm(Num_anura ~ Num_ranatra, family = poisson,
                     offset = log(Effort), data = min.trap)
Cand.mod[[8]] <- glm(Num_anura ~ 1, family = poisson,
                     offset = log(Effort), data = min.trap)
##check c-hat for global model
c_hat(Cand.mod[[1]], method = "pearson") #uses Pearson's chi-square/df

## 'c-hat' 1.04 (method: pearson estimator)

##note the very low overdispersion: in this case, the analysis could be
##conducted without correcting for c-hat as its value is reasonably close
##to 1
##assign names to each model
Modnames <- c("type + logperim + invertpred", "type + logperim",
              "type + invertpred", "type", "logperim + invertpred",
              "logperim", "invertpred", "intercept only")
##model selection table based on AICc
aictab(cand.set = Cand.mod, modnames = Modnames)

## 
## Model selection based on AICc:
## 
##                              K  AICc Delta_AICc AICcWt Cum.Wt     LL
## type + invertpred            3 54.03       0.00   0.60   0.60 -23.42
## type + logperim + invertpred 4 56.57       2.54   0.17   0.77 -23.23
## logperim + invertpred        3 57.91       3.88   0.09   0.86 -25.35
## invertpred                   2 58.63       4.60   0.06   0.92 -27.03
## type + logperim              3 59.38       5.35   0.04   0.96 -26.09
## type                         2 59.74       5.71   0.03   1.00 -27.58
## intercept only               1 65.47      11.44   0.00   1.00 -31.65
## logperim                     2 67.27      13.24   0.00   1.00 -31.35

dat.pred <- data.frame(Type = factor(c("BOG", "UPLAND")),
                       log.Perimeter = mean(min.trap$log.Perimeter),
                       Num_ranatra = mean(min.trap$Num_ranatra),
                       Effort = mean(min.trap$Effort))
yhat.response <- modavgPred(cand.set = Cand.mod, modnames = Modnames,
           newdata = min.trap, type = "response")
yhat.link <- modavgPred(cand.set = Cand.mod, modnames = Modnames,
           newdata = min.trap, type = "link")
yhat.default <- modavgPred(cand.set = Cand.mod, modnames = Modnames,
           newdata = min.trap)

res <- data.table(response=yhat.response$mod.avg.pred, 
                  link=exp(yhat.link$mod.avg.pred),
                  default=yhat.default$mod.avg.pred)

Cade, B.S. (2015). Model averaging and muddled multimodel inferences. Ecology 96, 2370–2382.↩

Should the model-averaged prediction be computed on the link or response scale in a GLM?

How does MuMIn model average predictions?

Results

How does bic.glm model average predictions?

How does the BAS package model-average predictions?

Result 1

Result 2

How does the AICcmodavg package model-average predictions?

R doodles. Some ecology. Some physiology. Much fake data.

How to make plots with factor levels below the x-axis (bench-biology style)

What is an interaction?

How to estimate synergism or antagonism

Type 3 ANOVA in R -- an easy way to publish wrong tables

Linear models with a covariate ("ANCOVA")

Normal Q-Q plots - what is the robust line and should we prefer it?

ANCOVA when the covariate is a mediator affected by treatment

Bootstrap confidence intervals when sample size is really small

What is the consequence of normalizing by each case in the control?

Melting a list of columns