This is fake data that simulates an experiment to measure effect of treatment on fat weight in mice. The treatment is “diet” with two levels: “control” (blue dots) and “treated” (gold dots). Diet has a large effect on total body weight. The simulated data are in the plot above - these look very much like the real data.
The question is, what are problems with using an “ancova” linear model to estimate the direct effect of treatment on fat weight?
This is a skeletal post to work up an answer to a twitter question using Wright’s rules of path models. Using this figure
from Panel A of a figure from Hernan and Cole. The scribbled red path coefficients are added
the question is I want to know about A->Y but I measure A* and Y*. So in figure A, is the bias the backdoor path from A* to Y* through A and Y?
Some background (due to Sewall Wright’s method of path analysis) Given a generating model:
\[\begin{equation} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \end{equation}\] where \(x_3 = x_1 x_2\); that is, it is an interaction variable.
The total effect of \(x_1\) on \(y\) is \(\beta_1 + \frac{\mathrm{COV}(x_1, x_2)}{\mathrm{VAR}(x_1)} \beta_2 + \frac{\mathrm{COV}(x_1, x_3)}{\mathrm{VAR}(x_1)} \beta_3\).
If \(x_3\) (the interaction) is missing, its component on the total efffect is added to the coefficient of \(x_1\).
This is a skeleton post
Standardized variables (Wright’s rules) n <- 10^5 # z is the common cause of g1 and g2 z <- rnorm(n) # effects of z on g1 and g2 b1 <- 0.7 b2 <- 0.7 r12 <- b1*b2 g1 <- b1*z + sqrt(1-b1^2)*rnorm(n) g2 <- b2*z + sqrt(1-b2^2)*rnorm(n) var(g1) # E(VAR(g1)) = 1 ## [1] 1.001849 var(g2) # E(VAR(g2)) = 1 ## [1] 1.006102 cor(g1, g2) # E(COR(g1,g2)) = b1*b2 ## [1] 0.