Copyright © Jeffrey A. Walker
Fig 1C of the Replication Study: Melanoma exosomes educate bone marrow progenitor cells toward a pro-metastatic phenotype through MET uses an odd (to me) three stage normalization procedure for the quantified western blots. The authors compared blot values between a treatment (shMet cells) and a control (shScr cells) using GAPDH to normalize the values. The three stages of the normalization are:
As shown in the simulation below and summarized in the section Summary of simulation results for n=4, Stage 1 can introduce inflated conditional type I error due to regression to the mean while stage 2 and 3 renormalizations introduce inflated marginal (or unconditional) type I error.
Normalizing a value using a reference, such as GAPDH, is used to increase the precision of a treatment effect by removing the noise in band intensity due to non-biological sources of variation. The reference value (intensity) is the proxy for this variation and the treatment or control values are expected to go up and down (that is have a positive correlation) with this reference. Normalizing by a reference value assumes that the correlation between reference values and either control or treatment values is 1.0 – that is they are precisely similarly effected by the non-biological sources of variation. At any correlation less that 1.0, there will be some consequence of regression to the mean due to the vicissitudes of sampling. This consequence is largest when the true correlation between reference values and control/treatment values is zero.
The consequences of regression to the mean on type I error can be shown by plotting the probability of type I error against the observed difference in the reference value between treatment and control, which I’ll refer to as \(\Delta Gapdh\) since GAPDH is the reference in the focal study. An increase in the probability of type I error as the magnitude of \(\Delta Gapdh\) increases is the result of regression to the mean – an experiment with a larger \(\Delta Gapdh\) is more likely to result in a larger observed treatment effect, when no true treatment effect exists, and therefore more likely to result in small p-values and inflated type I error.
The type I error as a function of the magnitude of \(\Delta Gapdh\) is the conditional type I error because it is conditional on \(\Delta Gapdh\). I refer to the type I error taken over all values of \(\Delta Gapdh\) as the marginal type I error. The marginal type I error is the type I error that we usually talk about (because we usually don’t think of it as being conditioned on some covariate)
library(here)
library(janitor)
library(data.table)
library(lmerTest)
library(emmeans)
library(ggpubr)
library(cowplot)
here <- here::here
data_path <- "content/data"
output_path <- "content/output"
simulate_it=FALSE # False indicates the simulations were done and written to disc
folder <- "Data from Generation and characterization of shMet B16-F10 cells and exosomes"
filename <- "Study_42_Figure_1_WB_quant_Data.csv"
file_path <- here(data_path, folder, filename)
exp1 <- fread(file_path)
exp1[, Condition := factor(Condition, c("shScr", "shMet"))]
#View(exp1)
The Met and pMet values in Fig 1C are normalized using a three-step procedure (each its own kind of normalization).
# get GAPDH ref for each row to rescale ("normalize") by GAPDH
gapdh_ref.dt <- exp1[Antibody=="Gapdh", .(gapdh_ref=mean(Value)), by=Set]
exp1.v1 <- merge(exp1, gapdh_ref.dt, by="Set")
exp1.v1[, norm1:=Value/gapdh_ref]
# get mean shScr for each Antibody:Type:Blot to rescale by mean shScr
shScr_ref.dt <- exp1.v1[Condition=="shScr", .(shScr_ref=mean(norm1)), by=.(Antibody, Type, Blot)]
exp1.v1 <- merge(exp1.v1, shScr_ref.dt, by=c("Antibody", "Type", "Blot"))
exp1.v1[, norm2:=norm1/shScr_ref]
exp1.v1[, norm3:=ifelse(Condition=="shScr", 1, norm2)]
#View(exp1.v1)
gg1 <- ggbarplot(data=exp1.v1[Antibody=="Met" & Type=="Cells"],
x="Condition",
y="norm1",
add=c("mean_se")) +
ylab("Met") +
NULL
gg2 <- ggbarplot(data=exp1.v1[Antibody=="pMet" & Type=="Cells",],
x="Condition",
y="norm1",
add=c("mean_se")) +
ylab("pMet") +
NULL
gg3 <- ggbarplot(data=exp1.v1[Antibody=="Met" & Type=="Cells"],
x="Condition",
y="norm3",
add=c("mean_se")) +
ylab("Met") +
NULL
gg4 <- ggbarplot(data=exp1.v1[Antibody=="pMet" & Type=="Cells",],
x="Condition",
y="norm3",
add=c("mean_se")) +
ylab("pMet") +
NULL
plot_grid(gg1, gg2, gg3, gg4, nrow=2)
The two bottom plots reproduce Fig 1C from the paper, which uses norm3. The two top plots are scaled by GAPDH but not shScr (norm1).
gg1 <- ggstripchart(data=exp1.v1[Antibody=="Met" & Type=="Cells"],
x="Condition",
y="Value",
add=c("mean_se")) +
ylab("Met") +
theme_minimal() +
NULL
gg2 <- ggstripchart(data=exp1.v1[Antibody=="pMet" & Type=="Cells",],
x="Condition",
y="Value",
add=c("mean_se")) +
ylab("pMet") +
theme_minimal() +
NULL
gg3 <- ggplot(data=exp1.v1[Antibody=="Met" & Type=="Cells"],
aes(x=gapdh_ref/10^3, y=Value, color=Condition)) +
geom_point() +
ylab("Met") +
xlab(expression(paste(Gapdh, " (X", 10^{-3}, ")"))) +
theme_minimal() +
NULL
gg4 <- ggplot(data=exp1.v1[Antibody=="pMet" & Type=="Cells"],
aes(x=gapdh_ref/10^3, y=Value, color=Condition)) +
geom_point() +
ylab("pMet") +
xlab(expression(paste(Gapdh, " (X", 10^{-3}, ")"))) +
theme_minimal() +
NULL
plot_grid(gg1, gg2, gg3, gg4, nrow=2)
We might infer from the top plots that the shMet condition decreases Met and pMet but the p-values from a t-test for these are 0.098 and 0.208 (see below) (note here and throughout, I use simple linear models and t-tests to compute p-values even though I would probably fit generalized linear models were I to analyze these data. For comparison, the Wilcoxan p-values are 0.1 and 0.34). The bottom plots suggest trivial correlations between GAPDH and either shScr or shMet, although the sample size for this is extremely small (see below for computations of various attempts to estimate this correlation).
Here I compare p-values of t-tests of the three different normalizations, with a t-test of the raw (non-normalized) values and with a linear model (“lm”) with \(Gapdh\) as a covariate, which is the preferred method of adjusting for nuissance co-variation.
prob <- numeric(5) # p-values for all five ways of analyzing data
# linear model with ref as covariate
m1 <- lm(Value ~ gapdh_ref + Condition, data=exp1.v1[Antibody=="Met" & Type=="Cells"])
# linear model with no accounting for ref
m2 <- lm(Value ~ Condition, data=exp1.v1[Antibody=="Met" & Type=="Cells"])
# linear model using Gapdh normalized values
m3 <- lm(norm1 ~ Condition, data=exp1.v1[Antibody=="Met" & Type=="Cells"])
# linear model using Gapdh normalized rescaled to shScr values
m4 <- lm(norm2 ~ Condition, data=exp1.v1[Antibody=="Met" & Type=="Cells"])
prob[1] <- coef(summary(m1))["ConditionshMet", "Pr(>|t|)"]
prob[2] <- coef(summary(m2))["ConditionshMet", "Pr(>|t|)"]
prob[3] <- coef(summary(m3))["ConditionshMet", "Pr(>|t|)"]
prob[4] <- coef(summary(m4))["ConditionshMet", "Pr(>|t|)"]
prob[5] <- t.test(x=exp1.v1[Antibody=="Met" &
Type=="Cells" &
Condition == "shMet", norm3],
mu=1)$p.value
prob6 <- wilcox.test(x=exp1.v1[Antibody=="Met" &
Type=="Cells" &
Condition == "shMet", Value],
y=exp1.v1[Antibody=="Met" &
Type=="Cells" &
Condition == "shScr", Value])
knitr::kable(data.table(Method=c("lm", "none", "norm1", "norm2", "norm3"),
"p-value" = prob), digits = 3,
caption = "P-values for Met")
Method | p-value |
---|---|
lm | 0.191 |
none | 0.098 |
norm1 | 0.128 |
norm2 | 0.079 |
norm3 | 0.014 |
prob <- numeric(5) # p-values for all five ways of analyzing data
# linear model with ref as covariate
m1 <- lm(Value ~ gapdh_ref + Condition, data=exp1.v1[Antibody=="pMet" & Type=="Cells"])
# linear model with no accounting for ref
m2 <- lm(Value ~ Condition, data=exp1.v1[Antibody=="pMet" & Type=="Cells"])
# linear model using Gapdh normalized values
m3 <- lm(norm1 ~ Condition, data=exp1.v1[Antibody=="pMet" & Type=="Cells"])
# linear model using Gapdh normalized rescaled to shScr values
m4 <- lm(norm2 ~ Condition, data=exp1.v1[Antibody=="pMet" & Type=="Cells"])
prob[1] <- coef(summary(m1))["ConditionshMet", "Pr(>|t|)"]
prob[2] <- coef(summary(m2))["ConditionshMet", "Pr(>|t|)"]
prob[3] <- coef(summary(m3))["ConditionshMet", "Pr(>|t|)"]
prob[4] <- coef(summary(m4))["ConditionshMet", "Pr(>|t|)"]
prob[5] <- t.test(x=exp1.v1[Antibody=="pMet" &
Type=="Cells" &
Condition == "shMet", norm3],
mu=1)$p.value
prob6 <- wilcox.test(x=exp1.v1[Antibody=="pMet" &
Type=="Cells" &
Condition == "shMet", Value],
y=exp1.v1[Antibody=="pMet" &
Type=="Cells" &
Condition == "shScr", Value])
knitr::kable(data.table(Method=c("lm", "none", "norm1", "norm2", "norm3"),
"p-value" = prob), digits = 3,
caption = "P-values for pMet")
Method | p-value |
---|---|
lm | 0.254 |
none | 0.208 |
norm1 | 0.227 |
norm2 | 0.000 |
norm3 | 0.003 |
Here I simulate the experiment in Fig 1 C of the paper. Effectively, this simulates an experiment with one control level, one treatment level, and a sample size of 4 (per level). Control and treatment levels are adjusted using a reference level (simulating GAPDH). The adjustments are 1) lm (linear model with \(Gapdh\) has covariate), 2) norm1 (the ratio of the control or treatment level divided by \(Gapdh\) level), 3) norm2 (norm1 rescaled to the control mean) and 4) norm3, equivalent to norm2 but the values for control are set to equal one.
As described above, normalization assumes a correlation of 1.0 between the reference and focal values. Here I compute multiple estimates of the true correlation between the reference values and the conditional response (conditioned on treatment level) in the whole data set and different subgroups. The true correlation is estimated with large error, because of the small sample size, so the estimates here are very uncertain. Nevertheless, the different estimates are pretty consistent that this correlation is very small. Regardless, Keep these values in mind.
r <- numeric(5)
dataset <- rep("", 5)
fit <- lm(Value ~ Condition + Type + Antibody,
data=exp1.v1[Antibody!="Gapdh"])
r[1] <- cor(residuals(fit), exp1.v1[Antibody!="Gapdh", gapdh_ref])
dataset[1] <- "whole dataset"
fit <- lm(Value ~ Condition + Antibody,
data=exp1.v1[Antibody!="Gapdh" & Type=="Cells"])
r[2] <- cor(residuals(fit), exp1.v1[Antibody!="Gapdh" & Type=="Cells", gapdh_ref])
dataset[2] <- "subset of Type=Cells"
fit <- lm(Value ~ Condition + Type, data=exp1.v1[Antibody=="Met"])
r[3] <- cor(residuals(fit), exp1.v1[Antibody=="Met", gapdh_ref])
dataset[3] <- "subset of Antibody=Met"
fit <- lm(Value ~ Condition, data=exp1.v1[Antibody=="Met" & Type=="Cells"])
r[4] <- cor(residuals(fit), exp1.v1[Antibody=="Met" & Type=="Cells", gapdh_ref])
dataset[4] <- "subset of Antibody=Met and Type=Cells"
fit <- lm(Value ~ Condition, data=exp1.v1[Antibody=="pMet"])
r[5] <- cor(residuals(fit), exp1.v1[Antibody=="pMet", gapdh_ref])
dataset[5] <- "subset of Antibody=pMet (all Type=Cells)"
knitr::kable(data.table(Dataset=dataset, Cor=r), digits=3)
Dataset | Cor |
---|---|
whole dataset | 0.027 |
subset of Type=Cells | 0.068 |
subset of Antibody=Met | -0.020 |
subset of Antibody=Met and Type=Cells | -0.153 |
subset of Antibody=pMet (all Type=Cells) | 0.027 |
exp1[Antibody=="Met" & Type=="Cells", .(N=.N), by=.(Antibody, Type, Blot, Condition)]
## Antibody Type Blot Condition N
## 1: Met Cells https://osf.io/e3dny/ shScr 1
## 2: Met Cells https://osf.io/e3dny/ shMet 1
## 3: Met Cells https://osf.io/nra32/ shMet 2
## 4: Met Cells https://osf.io/nra32/ shScr 2
exp1[Antibody=="pMet" & Type=="Cells", .(N=.N), by=.(Antibody, Type, Blot, Condition)]
## Antibody Type Blot Condition N
## 1: pMet Cells https://osf.io/bnxht/ shScr 1
## 2: pMet Cells https://osf.io/bnxht/ shMet 1
## 3: pMet Cells https://osf.io/nra32/ shMet 2
## 4: pMet Cells https://osf.io/nra32/ shScr 2
## 5: pMet Cells https://osf.io/dg4kx/ shScr 1
## 6: pMet Cells https://osf.io/dg4kx/ shMet 1
Met data: one replicate in one blot and two replicates in one blot for the Met data (so n=3);
pMet data: one replicate in two blots and two replicates in one blots (so n=4).
The functions for generating a data set with a control, a treatment, and a reference for normalization. norm1, norm2, and norm3 are defined as above. The parameter rho controls the expected correlation between the reference and either the control or treatment. The explored values are (0, 0.3, 0.6). Note the empirical estimates of rho are close to zero.
get_fake_data <- function(
n=4, # number of replicates per treatment level
rho=0.5, # correlation between the reference value and that of a control level
s_kappa=1, # effect of treatment on shape (this is multiplicative so 1 = no effect)
s_theta=1, # effect of treatment on scale (this is multiplicative so 1 = no effect
kappa_0=80, # shape parameter for reference
theta_0=100, # scale parameter for reference
kappa_1=30, # shape parameter for control
theta_1=100 # scale parameter for control
){
kappa_1_i <- rep(c(kappa_1, kappa_1*s_kappa), each=n)
theta_1_i <- rep(c(theta_1, theta_1*s_theta), each=n)
y1 <- rgamma(n*2, shape=kappa_0 - rho*sqrt(kappa_0*kappa_1), scale=1)
y2 <- rgamma(n*2, shape=kappa_1_i - rho*sqrt(kappa_0*kappa_1_i), scale=1)
y3 <- rgamma(n*2, shape=rho*sqrt(kappa_0*kappa_1), scale=1)
fd <- data.table(
treatment=rep(c("cn", "tr"), each=n),
gapdh=theta_0*(y1+y3),
value=theta_1_i*(y2+y3)
)
return(fd)
}
simulate_experiment <- function(
n=4, # number of replicates per treatment level
blot_id=rep("blot1", n), # how to divy up the n samples among blots
rho=0.5, # correlation between the reference value and that of a control level
s_kappa=1, # effect of treatment on shape (this is multiplicative so 1 = no effect)
s_theta=1, # effect of treatment on scale (this is multiplicative so 1 = no effect
kappa_0=80, # shape parameter for reference
theta_0=100, # scale parameter for reference
kappa_1=30, # shape parameter for control
theta_1=100 # scale parameter for control
){
fd <- get_fake_data(n, rho, s_kappa, s_theta,
kappa_0, theta_0, kappa_1, theta_1)
fd[, blot:=rep(blot_id, 2)]
fd[, norm1:=value/gapdh]
cn_ref_list <- fd[treatment=="cn", .(cn_ref=mean(norm1)), by=blot]
fd <- merge(fd, cn_ref_list, by="blot")
fd[, norm2:=norm1/cn_ref]
fd[, norm3:=ifelse(treatment=="cn", 1, norm2)]
return(fd)
}
iterate_experiment <- function(
n=4, # number of replicates per treatment level
blot_id=rep("blot1", n), # how to divy up the n samples among blots
niter=2000, # number of iterations
rho=0.5, # correlation between the reference value and that of a control level
s_kappa=1, # effect of treatment on shape (this is multiplicative so 1 = no effect)
s_theta=1, # effect of treatment on scale (this is multiplicative so 1 = no effect
kappa_0=80, # shape parameter for reference
theta_0=100, # scale parameter for reference
kappa_1=30, # shape parameter for control
theta_1=100 # scale parameter for control
){
# Given a western blot with three "treatments": reference (Gapdh) is the set of values for normaliztion
# control is the set of values for a control. The treatment value is determined by beta_1 -- the effect
prob_cols <- c("lm", "norm1", "norm2", "norm3")
prob <- data.table(matrix(-9999, nrow=niter, ncol=length(prob_cols)))
setnames(prob, old=colnames(prob), new=prob_cols)
effect_cols <- c("delta_gapdh", "effect_lm", "effect_norm1", "effect_norm2", "effect_norm3")
effects_dt <- data.table(matrix(-9999, nrow=niter, ncol=length(effect_cols)))
setnames(effects_dt, old=colnames(effects_dt), new=effect_cols)
r <- numeric(niter) # dor between control and gapdh values
se.norm1 <- numeric(niter) # se.norm1
set.seed(1) # yes I want to reset this to the same with each combo
for(iter in 1:niter){
fd <- simulate_experiment(
n=n, # number of replicates per treatment level
blot_id=blot_id, # how to divy up the n samples among blots
rho=rho, # correlation between the reference value and that of a control level
s_kappa=s_kappa, # effect of treatment on shape (this is multiplicative so 1 = no effect)
s_theta=s_theta, # effect of treatment on scale (this is multiplicative so 1 = no effect
kappa_0=kappa_0, # shape parameter for reference
theta_0=theta_0, # scale parameter for reference
kappa_1=kappa_1, # shape parameter for control
theta_1=theta_1 # scale parameter for control
)
m1 <- lm(value ~ gapdh + treatment, data=fd)
m1.coef <- coef(summary(m1))
prob[iter, lm := m1.coef["treatmenttr", "Pr(>|t|)"]]
m2 <- lm(norm1 ~ treatment, data=fd)
m2.coef <- coef(summary(m2))
prob[iter, norm1 := m2.coef["treatmenttr", "Pr(>|t|)"]]
# prob[iter, norm1 := t.test(fd[treatment=="cn", norm1], fd[treatment=="tr", norm1], var.equal=TRUE)$p.value]
prob[iter, norm2 := t.test(fd[treatment=="cn", norm2], fd[treatment=="tr", norm2], var.equal=TRUE)$p.value]
prob[iter, norm3 := t.test(x=fd[treatment=="tr", norm3], mu=1)$p.value]
effects_dt[iter, delta_gapdh := mean(fd[treatment=="tr", gapdh]) - mean(fd[treatment=="cn", gapdh])]
effects_dt[iter, effect_lm := m1.coef["treatmenttr", "Estimate"]]
effects_dt[iter, effect_norm1 := m2.coef["treatmenttr", "Estimate"]]
effects_dt[iter, effect_norm2 := mean(fd[treatment=="tr", norm2]) - mean(fd[treatment=="cn", norm2])]
effects_dt[iter, effect_norm3 := mean(fd[treatment=="tr", norm3]) - 1]
r[iter] <- cor(fd[treatment=="cn", gapdh], fd[treatment=="cn", value])
se.norm1[iter] <- m2.coef["treatmenttr", "Std. Error"]
}
return(
data.table(prob, effects_dt, cor=r, se_norm1=se.norm1)
)
}
n=10^4 # number of replicates per treatment level
rho=0.5 # correlation between the reference value and that of a control level
s_kappa=1.0 # effect of treatment on shape (this is multiplicative so 1 = no effect)
s_theta=1.0 # effect of treatment on scale (this is multiplicative so 1 = no effect
kappa_0=80 # shape parameter for reference
theta_0=100 # scale parameter for reference
kappa_1=30 # shape parameter for control
theta_1=100 # scale parameter for control
(s_kappa*kappa_1*s_theta*theta_1 - kappa_1*theta_1)/(sqrt(kappa_1*theta_1^2))
fd <- get_fake_data(n, rho, s_kappa, s_theta,
kappa_0, theta_0, kappa_1, theta_1)
# quick and dirty cohen's
kappa_1*theta_1
s_kappa*kappa_1*s_theta*theta_1
(means_table <- fd[, .(cell_mean=mean(value), cell_sd=sd(value)), by=treatment])
(means_table[treatment=="tr", cell_mean] - means_table[treatment=="cn", cell_mean])/means_table[treatment=="cn", cell_sd]
plot_effects <- function(res){
prob_cols <- c("lm", "norm1", "norm2", "norm3")
gg1 <- qplot(x=res$delta_gapdh/10^3, y=res$effect_lm) +
geom_smooth(method="lm") +
ggtitle("Linear model") +
xlab(expression(paste(Gapdh[t] - Gapdh[c], "(X 1000)"))) +
ylab("Effect") +
theme_minimal() +
NULL
gg2 <- qplot(x=res$delta_gapdh/10^3, y=res$effect_norm1) +
geom_smooth(method="lm") +
ggtitle("Norm1") +
xlab(expression(paste(Gapdh[t] - Gapdh[c], "(X 1000)"))) +
ylab("Effect") +
theme_minimal() +
NULL
gg3 <- qplot(x=res$delta_gapdh/10^3, y=res$effect_norm2) +
geom_smooth(method="lm") +
ggtitle("Norm2") +
xlab(expression(paste(Gapdh[t] - Gapdh[c], "(X 1000)"))) +
ylab("Effect") +
theme_minimal() +
NULL
gg4 <- qplot(x=res$delta_gapdh/10^3, y=res$effect_norm3) +
geom_smooth(method="lm") +
ggtitle("Norm3") +
xlab(expression(paste(Gapdh[t] - Gapdh[c], "(X 1000)"))) +
ylab("Effect") +
theme_minimal() +
NULL
gg <- plot_grid(gg1, gg2, gg3, gg4, nrow=2)
return(gg)
}
plot_se <- function(res){
prob_cols <- c("lm", "norm1", "norm2", "norm3")
blot_levels <- unique(res$blots)
i <- 1
gg1 <- qplot(x=res[blots==blot_levels[i], delta_gapdh/10^3], y=res[blots==blot_levels[i], se_norm1]) +
geom_smooth(method="lm") +
ggtitle(paste("blots =", blot_levels[i])) +
xlab(expression(paste(Gapdh[t] - Gapdh[c], "(X 1000)"))) +
ylab("Effect") +
theme_minimal() +
NULL
i <- 2
gg2 <- qplot(x=res[blots==blot_levels[i], delta_gapdh/10^3], y=res[blots==blot_levels[i], se_norm1]) +
geom_smooth(method="lm") +
ggtitle(paste("blots =", blot_levels[i])) +
xlab(expression(paste(Gapdh[t] - Gapdh[c], "(X 1000)"))) +
ylab("Effect") +
theme_minimal() +
NULL
i <- 3
gg3 <- qplot(x=res[blots==blot_levels[i], delta_gapdh/10^3], y=res[blots==blot_levels[i], se_norm1]) +
geom_smooth(method="lm") +
ggtitle(paste("blots =", blot_levels[i])) +
xlab(expression(paste(Gapdh[t] - Gapdh[c], "(X 1000)"))) +
ylab("Effect") +
theme_minimal() +
NULL
i <- 4
gg4 <- qplot(x=res[blots==blot_levels[i], delta_gapdh/10^3], y=res[blots==blot_levels[i], se_norm1]) +
geom_smooth(method="lm") +
ggtitle(paste("blots =", blot_levels[i])) +
xlab(expression(paste(Gapdh[t] - Gapdh[c], "(X 1000)"))) +
ylab("Effect") +
theme_minimal() +
NULL
gg <- plot_grid(gg1, gg2, gg3, gg4, nrow=2)
return(gg)
}
plot_prob_t1 <- function(res){
res[, t1.lm:=ifelse(lm <= 0.05, 1, 0)]
res[, t1.norm1:=ifelse(norm1 <= 0.05, 1, 0)]
res[, t1.norm2:=ifelse(norm2 <= 0.05, 1, 0)]
res[, t1.norm3:=ifelse(norm3 <= 0.05, 1, 0)]
gg1 <- ggplot(data=res, aes(x=abs(delta_gapdh), y=t1.lm)) +
geom_smooth(method='glm', method.args=list(family='binomial')) +
ylab("Prob(Type I): linear model") +
xlab(expression(paste("|",Gapdh[t] - Gapdh[c],"|"))) +
theme_minimal()
gg2 <- ggplot(data=res, aes(x=abs(delta_gapdh), y=t1.norm1)) +
geom_smooth(method='glm', method.args=list(family='binomial')) +
ylab("Prob(Type I): norm1") +
xlab(expression(paste("|",Gapdh[t] - Gapdh[c],"|"))) +
theme_minimal()
gg3 <- ggplot(data=res, aes(x=abs(delta_gapdh), y=t1.norm2)) +
geom_smooth(method='glm', method.args=list(family='binomial')) +
ylab("Prob(Type I): norm2") +
xlab(expression(paste("|",Gapdh[t] - Gapdh[c],"|"))) +
theme_minimal()
gg4 <- ggplot(data=res, aes(x=abs(delta_gapdh), y=t1.norm3)) +
geom_smooth(method='glm', method.args=list(family='binomial')) +
ylab("Prob(Type I): norm3") +
xlab(expression(paste("|",Gapdh[t] - Gapdh[c],"|"))) +
theme_minimal()
gg <- plot_grid(gg1, gg2, gg3, gg4, nrow=2)
return(gg)
}
plot_prob_t1_2 <- function(res, ycol, two_d=FALSE){
res[, t1:=ifelse(get(ycol) <= 0.05, 1, 0)]
gg <- ggplot(data=res, aes(x=abs(delta_gapdh), y=t1)) +
geom_smooth(method='glm', method.args=list(family='binomial')) +
ylab(paste("Prob(Type I):", ycol)) +
xlab(expression(paste("|",Gapdh[t] - Gapdh[c],"|"))) +
geom_hline(aes(yintercept = 0.05), color="red") +
theme_minimal() +
NULL
if(two_d==TRUE){
gg <- gg + facet_grid(blots ~ rho, labeller=label_both)
}else{
gg <- gg + facet_grid(. ~ rho, labeller=label_both)
}
return(gg)
}
get_type_1_table <- function(res){
prob_cols <- c("lm","norm1", "norm2","norm3")
niter <- nrow(res)/length(unique(res[, blots]))/length(unique(res[, rho]))
res_long <- melt(res, id.vars = c("blots", "rho"), measure.vars = prob_cols, variable.name = "method", value.name = "p.value")
type_1 <- res_long[, .(Type_I=sum(p.value < 0.05)/niter), by=.(blots, rho, method)]
type_1_wide <- dcast(type_1, blots ~ method, value.var = "Type_I")
return(type_1_wide)
}
The simulation computes type I error at all combinations of \(blots\) and \(rho\). The parameter \(blots\) is the number of unique blots. The number of replicates per blot is n/blots, so with blots=1 there is one blot with four replicates. With n=4 and blots=3, there are 1, 1, and 2 replicates in the three blots, which is the case for pMet in Fig 1C. The parameter \(rho\) is the expected correlation between the reference level and either the control or treatment level. Again, the empirical estimate of rho are close to zero. Effects and p-values are computed for 10,000 iterations of each combination of \(blots\) and \(rho\). Sample size of \(n=4\) (that of the pMet data in the study. The Met data was \(n=3\)) and \(n=10\) were run.
# parameters
# n=4, # number of replicates per treatment level
# blots=blots_i, # number of blots. Reps/blot = n/blots
# niter=1000, # number of iterations
# rho=0, # correlation between the reference value and that of a control level
# s_kappa=1, # effect of multiplicative treatment on shape
# s_theta=1, # effect of multiplicative treatment on scale
# kappa_0=80, # shape parameter for reference
# theta_0=100, # scale parameter for reference
# kappa_1=30, # shape parameter for control
# theta_1=100 # scale parameter for control
simulate_it <- FALSE
which_sim <- "sim1"
if(which_sim=="sim1"){
file_path <- here(output_path, "normalization_II_sim1.rds")
n_iter <- 10000
rho_vec <- c(0, 0.3, 0.6) # cor between ref and either cn or tr
n <- 4
blots_vec <- c(1, 2, 3, 4)
reps_per_blot_mat <- t(matrix(c(c(rep(4, 1), rep(NA, 3)),
c(rep(2, 2), rep(NA, 2)),
c(c(1, 1, 2), rep(NA, 1)),
rep(1, 4)), ncol=4))
}
if(which_sim=="sim2"){
file_path <- here(output_path, "normalization_II_sim2.rds")
n_iter <- 10000
rho_vec <- c(0, 0.3, 0.6) # cor between ref and either cn or tr
n <- 10
blots_vec <- c(1, 2, 5, 10) # number of blots
reps_per_blot_mat <- t(matrix(c(c(rep(10, 1), rep(NA, 9)),
c(rep(5, 2), rep(NA, 8)),
c(rep(2, 5), rep(NA, 5)),
rep(1, 10)), ncol=4))
}
if(simulate_it==TRUE){
sim_combo <- expand.grid(blots=blots_vec, rho=rho_vec)
blot_id_mat <- data.frame(matrix(nrow=nrow(reps_per_blot_mat), ncol=n))
for(i in 1:nrow(reps_per_blot_mat)){
reps_per_blot <- na.omit(reps_per_blot_mat[i,])
blot_names <- paste0("blot", seq_along(reps_per_blot))
blot_id_mat[i,] <- rep(blot_names, times=reps_per_blot)
}
blot_id_mat$blots <- blots_vec
sim_combo <- data.table(merge(sim_combo, blot_id_mat, by="blots"))
blot_id_cols <- paste0("X", 1:n)
res <- data.table(NULL)
for(combo in 1:nrow(sim_combo)){
blots_i <- sim_combo[combo, blots]
blot_id_i <- unlist(sim_combo[combo, .SD, .SDcols=blot_id_cols])
rho_i <- sim_combo[combo, rho]
res <- rbind(res,
data.table(blots=blots_i,
rho=rho_i,
iterate_experiment(
n=n,
blot_id=blot_id_i,
niter=n_iter,
rho=rho_i,
s_kappa=1,
s_theta=1,
kappa_0=80,
theta_0=100,
kappa_1=30,
theta_1=100
)))
}
saveRDS(res, file = file_path)
}else{
res <- readRDS(file = file_path)
}
Again, the marginal type I error is just the normal type I error.
file_path <- here(output_path, "normalization_II_sim1.rds")
res <- readRDS(file = file_path)
type1_a <- get_type_1_table(res[rho==0])
type1_b <- get_type_1_table(res[rho==0.3])
knitr::kable(type1_a, full_width=FALSE,
digits=3,
caption="A. Marginal Type I error. rho=0.0")
blots | lm | norm1 | norm2 | norm3 |
---|---|---|---|---|
1 | 0.052 | 0.049 | 0.049 | 0.117 |
2 | 0.052 | 0.049 | 0.060 | 0.085 |
3 | 0.052 | 0.049 | 0.074 | 0.064 |
4 | 0.052 | 0.049 | 0.090 | 0.049 |
knitr::kable(type1_b, full_width=FALSE,
digits=3,
caption="B. Marginal Type I error. rho=0.3")
blots | lm | norm1 | norm2 | norm3 |
---|---|---|---|---|
1 | 0.046 | 0.049 | 0.049 | 0.108 |
2 | 0.046 | 0.049 | 0.059 | 0.081 |
3 | 0.046 | 0.049 | 0.071 | 0.062 |
4 | 0.046 | 0.049 | 0.089 | 0.050 |
The correlation parameter \(rho\) doesn’t have much of an effect. Both lm and norm1 have nominal type I error regardless of rho or number of blots. Type I error of norm2 is increasingly inflated as the number of blots increases and the number of replicates/blot decreases. The reverse is the case for type I error of norm3 – type I error is inflated when all replicates are on a single blot. Note that the type I error for lm and norm1 are not a function of how replicates are structured.
Linear Model
plot_prob_t1_2(res, ycol="lm", two_d=TRUE)
Norm1
plot_prob_t1_2(res, ycol="norm1", two_d=TRUE)
Norm2
plot_prob_t1_2(res, ycol="norm2", two_d=TRUE)
Norm3
plot_prob_t1_2(res, ycol="norm3", two_d=TRUE)
The plots below show the estimated probability of a Type I error from the simulation and show two different sources of Type I error:
Combined, the simulation shows
file_path <- here(output_path, "normalization_II_sim2.rds")
res <- readRDS(file = file_path)
type1_a <- get_type_1_table(res[rho==0])
type1_b <- get_type_1_table(res[rho==0.3])
knitr::kable(type1_a, full_width=FALSE,
digits=3,
caption="A. Marginal Type I error. rho=0.0")
blots | lm | norm1 | norm2 | norm3 |
---|---|---|---|---|
1 | 0.05 | 0.05 | 0.050 | 0.151 |
2 | 0.05 | 0.05 | 0.053 | 0.133 |
5 | 0.05 | 0.05 | 0.056 | 0.093 |
10 | 0.05 | 0.05 | 0.068 | 0.051 |
knitr::kable(type1_b, full_width=FALSE,
digits=3,
caption="B. Marginal Type I error. rho=0.3")
blots | lm | norm1 | norm2 | norm3 |
---|---|---|---|---|
1 | 0.045 | 0.046 | 0.046 | 0.141 |
2 | 0.045 | 0.046 | 0.048 | 0.125 |
5 | 0.045 | 0.046 | 0.054 | 0.086 |
10 | 0.045 | 0.046 | 0.066 | 0.049 |
Linear model
plot_prob_t1_2(res, ycol="lm", two_d=TRUE)
norm1
plot_prob_t1_2(res, ycol="norm1", two_d=TRUE)
norm2
plot_prob_t1_2(res, ycol="norm2", two_d=TRUE)
norm3
plot_prob_t1_2(res, ycol="norm3", two_d=TRUE)
The general pattern is the same for \(n=10\) as for \(n=4\) and summarized above in Summary of simulation results for n=4
I wouldn’t think this is news but I’m not familiar with the literature. Google Scholaring found mostly normalization in microarray/RNAseq type stuff and much of this is concerned with different issues. There is abundant literature on normalizing for body weight in organismal physiology and adjusting for baseline in pre-post designs, but there doesn’t seem to be much acknowledgment within experimental biology of regression to the mean due to normalizing measures like band intensity. I did find this,
Janušonis, S., 2009. Comparing two small samples with an unstable, treatment-independent baseline. Journal of neuroscience methods, 179(2), pp.173-178.
which doesn’t discuss regression to the mean and inflated conditional type I error even though it cites some of this literature.