What is the bias in the estimation of an effect given an omitted interaction term?
Some background (due to Sewall Wright’s method of path analysis)
Given a generating model:
where ; that is, it is an interaction variable.
The total effect of on is .
If (the interaction) is missing, its component on the total efffect is added to the coefficient of . This added component is the bias due to the omitted interaction.
Definitions
This computation assumes an interaction effect is defined as the “leftover” residual after adding up the intercept plus additive effects – that is, the coefficient estimated using dummy coding (which is R’s default coding). See Is the power to test an interaction effect less than that for a main effect? for other definitions/parameterizations of an interaction.
Libraries
I use data.table
library(data.table)
Factorial design with ommitted interaction
Let’s simulate this with a factorial design. In a balanced fatorial design, , so the bias is
n <- 10^6 # a high n is used so that the estimate is very close to the true expected value
beta_0 <- 0
beta_1 <- 1
beta_2 <- 0.5
beta_3 <- 0.5
beta <- c(beta_0, beta_1, beta_2, beta_3)
sigma <- 1
fd <- data.table(
x1=factor(rep(c("CnA", "TrA"), each=2*n)),
x2=factor(rep(rep(c("CnB", "TrB"), each=n), 2))
)
X <- model.matrix(~x1*x2, data=fd)
cov_X <- cov(X)
fd[, y:=(X%*%beta)[,1] + rnorm(n*4)]
The bias is
b <- cov_X[4,2]/cov_X[2,2] # this will be 0.5 in balanced design
b*beta_3
# [1] 0.25
The expected coefficient given the known true effect plus the bias is
beta_1 + b*beta_3
# [1] 1.25
The coefficient of in the linear model with the missing interaction estimates this biased effect.
coef(lm(y ~ x1 + x2, data=fd))[2]
# x1TrA
# 1.248929
Two continuous with interaction
n <- 10^6
# generate two correlated x
rho <- 0.6 # true correlation between the two x variables
z <- rnorm(n)
sigma_x <- sqrt(1-rho)
x1 <- sqrt(rho)*z + rnorm(n, sd=sigma_x) # expected variance is 1
x2 <- sqrt(rho)*z + rnorm(n, sd=sigma_x) # expected variance is 1
x3 <- x1*x2
beta_0 <- 0
beta_1 <- 1
beta_2 <- 0.5
beta_3 <- 0.5
beta <- c(beta_0, beta_1, beta_2, beta_3)
sigma <- 1
fd <- data.table(
x1=x1,
x2=x2
)
X <- model.matrix(~x1*x2, data=fd)
cov_X <- cov(X)
fd[, y:=(X%*%beta)[,1] + rnorm(n)]
The bias is
b <- cov_X[4,2]/cov_X[2,2] # this will be 0.5 in balanced design
b*beta_3
# [1] -0.001261114
huh. Is this generally the case that the regression of the interaction on the main variable is near zero? If it is, then an omitted interaction would have little bias, as here.
The expected coefficient given the known true effect plus the bias is
beta_1 + b*beta_3
# [1] 0.9987389
The coefficient of in the linear model with the missing interaction estimates this biased effect.
coef(lm(y ~ x1 + x2, data=fd))[2]
# x1
# 0.9966902