Using Wright's rules and a DAG to compute the bias of an effect when we measure proxies for X and Y
This is a skeletal post to work up an answer to a twitter question using Wright’s rules of path models. Using this figure
the question is I want to know about A->Y but I measure A* and Y*. So in figure A, is the bias the backdoor path from A* to Y* through A and Y?
Short answer: the bias is the path as stated divided by the path from A to Y.
Medium answer: We want to estimate \(\beta\), the true effect of \(A\) on \(Y\). We have measured the proxies, \(A^\star\) and \(Y^\star\). The true effect of \(A\) on \(A^\star\) is \(\alpha_1\) and the true effect of \(Y\) on \(Y^\star\) is \(\alpha_2\).
So what we want is \(\beta\) but what we estimate is \(\alpha_1 \beta \alpha_2\) (the path from \(A^\star\) to \(Y^\star\)) so the bias is \(\alpha_1 \alpha_2\).
But, this is really only true for standardized effects. If the variables are not variance-standardized (and why should they be?), the bias is a bit more complicated.
TL;DR answer: In terms of \(\beta\) the estimated effect is
\[\begin{equation} \alpha_1 \alpha_2 \frac{\sigma_A^2}{\sigma_{A^\star}^2} \beta \end{equation}\]so the bias is
\[\begin{equation} k = \alpha_1 \alpha_2 \frac{\sigma_A^2}{\sigma_{A^\star}^2} \end{equation}\]The derivation is scratched out here:
If the data are standardized (all variables have unit variance)
This is easy and just uses Wright’s rules of adding up effects along a path
n <- 10^6
beta <- 0.6 # true effect
alpha_1 <- 0.9 # standardized effect of A on A* -- this is the correlation of A with proxy
alpha_2 <- 0.8 # standardized effect of Y o Y* -- this is the correlation of Y with proxy
A <- rnorm(n)
Y <- beta*A + sqrt(1 - beta^2)*rnorm(n)
astar <- alpha_1*A + sqrt(1 - alpha_1^2)*rnorm(n) # proxy for A
ystar <- alpha_2*Y + sqrt(1 - alpha_2^2)*rnorm(n) # proxy for Y
\(\beta\) is the true effect and the expected estimated effect is \(\alpha_1 \beta \alpha_2\) (using Wright rules) so \(\alpha_1 \alpha_2\) is the bias. Note this isn’t added to the true effect as in omitted variable bias (confounding). We can check this with the fake data.
alpha_1*beta*alpha_2 # expected measured effect
## [1] 0.432
coef(lm(ystar ~ astar)) # measured effect
## (Intercept) astar
## 0.0007277386 0.4313778378
check some other measures
var(A) # should be 1
## [1] 1.002161
var(Y) # should be 1
## [1] 1.001362
var(astar) # should be 1
## [1] 1.00204
var(ystar) # should be 1
## [1] 0.998893
cor(ystar, astar) # should be equal to expected measured effect
## [1] 0.4320569
if the data are not standardized
n <- 10^5
rho_alpha_1 <- 0.9 # correlation of A and A*
rho_alpha_2 <- 0.8 # correlation of Y and Y*
rho_b <- 0.6 # standardized true effect of A on Y
sigma_A <- 2 # total variation in A
sigma_Y <- 10 # total variation in Y
sigma_astar <- 2.2 # total variation in A*
sigma_ystar <- 20 # total variation in Y*
alpha_1 <- rho_alpha_1*sigma_astar/sigma_A # effect of A on astar
alpha_2 <- rho_alpha_2*sigma_ystar/sigma_Y # effect of Y on ystar
beta <- rho_b*sigma_Y/sigma_A # effect of A on Y (the thing we want)
A <- rnorm(n, sd=sigma_A)
R2_Y <- (beta*sigma_A)^2/sigma_Y^2 # R^2 for E(Y|A)
Y <- beta*A + sqrt(1-R2_Y)*rnorm(n, sd=sigma_Y)
R2_astar <- (alpha_1*sigma_A)^2/sigma_astar^2 # R^2 for E(astar|A)
astar <- alpha_1*A + sqrt(1-R2_astar)*rnorm(n, sd=sigma_astar)
R2_ystar <- (alpha_2*sigma_Y)^2/sigma_ystar^2 # R^2 for E(ystar|Y)
ystar <- alpha_2*Y + sqrt(1-R2_ystar)*rnorm(n, sd=sigma_ystar)
Now let’s check our math in the figure above. Here is the estimated effect
coef(lm(ystar ~ astar))
## (Intercept) astar
## 0.004357835 3.950256435
And the expected estimated effect using just the standardized coefficients
rho_alpha_1*rho_alpha_2*rho_b*sigma_ystar/sigma_astar
## [1] 3.927273
And the expected estimated effect using the equation \(k \beta\), where k is the bias (this is in the top image of the derivation)
k <- rho_alpha_1*sigma_A/sigma_Y*rho_alpha_2*sigma_ystar/sigma_astar
k*beta
## [1] 3.927273
And finally, the expected estimated effect using the bias as a function of the unstandardized variables (this is in the bottom – part 2– image of the derivation)
k <- alpha_1*alpha_2*sigma_A^2/sigma_astar^2
k*beta
## [1] 3.927273
And the true effect?
beta
## [1] 3
Some other checks
coef(lm(ystar ~ Y))
## (Intercept) Y
## -0.03612649 1.60271514
alpha_2
## [1] 1.6
coef(lm(ystar ~ A))
## (Intercept) A
## -0.00254846 4.80956599
alpha_2*beta
## [1] 4.8
coef(lm(astar ~ A))
## (Intercept) A
## -0.002796505 0.991465329
alpha_1
## [1] 0.99
sd(A)
## [1] 2.014547
sd(Y)
## [1] 10.04232
sd(astar)
## [1] 2.215202
sd(ystar)
## [1] 20.09823
cor(A, astar)
## [1] 0.9016573
cor(Y, ystar)
## [1] 0.8008157
```