Motivator: A twitter comment “Isn’t the implication that the large effect size is a direct byproduct of the lack of power? i.e. that if the the study had more power, the effect size would have been found to be smaller.”¹ ²

A thought: our belief in the magnitude of an observed effect should be based on our priors, which, hopefully, are formed from good mechanistic models and not sample size“.³

If we observe a large effect but the sample size is small, then should we believe that the effect is strongly inflated?
If we had measured a larger sample, would the effect be smaller?

But…maybe sample size should influence our prior, because the expected estimated effect magnitude is bigger than than the true effect if the true effect is near zero explored a bit here. This is because, if an effect is near zero, estimates will vary on both sides of zero, and the absolute value of most of these estimates will be bigger than the absolute value of the true effect. But what effect size should we worry about this?

Set up

library(ggplot2)
library(ggpubr)

## Loading required package: magrittr

library(data.table)

The simulation

Simulate an experiment with two treatment levels (“control” and “treated”), with standardized (\(\frac{\delta}{\sigma}\)) effect sizes of 0.05, .1, .2, .3, .5, .8, 1, 2 and sample sizes of 100, 20, and 10. Cohen considered .8 a “large” standardized effect but I’ll leave what is large up to you. Regardless, its worth comparing the results here to observed effects.

n <- 100 # per treatment level. power will be a function of effect size
b_array <- c(0.05, 0.1, 0.2, 0.3, 0.5, 0.8, 1, 2)
niter <- 10^4
res_table <- data.table(NULL)
power_table <- data.table(NULL)
for(b1 in b_array){
  y1 <- matrix(rnorm(n*niter), nrow=n)
  y2 <- matrix(rnorm(n*niter), nrow=n) + b1
  d100 <- apply(y2, 2, mean) - apply(y1,2,mean)
  d20 <- apply(y2[1:20,], 2, mean) - apply(y1[1:20,],2,mean)
  d10 <- apply(y2[1:10,], 2, mean) - apply(y1[1:10,],2,mean)
  res_table <- rbind(res_table, data.table(b=b1, d100=d100, d20=d20, d10=d10))
  power_table <- rbind(power_table, data.table(
    b=b1,
    "power (n=100)"=power.t.test(n=100, delta=b1, sd=1)$power,
    "power (n=20)"=power.t.test(n=20, delta=b1, sd=1)$power,
    "power (n=10)"=power.t.test(n=10, delta=b1, sd=1)$power
  ))
}

Power for each simulation combination

knitr::kable(power_table, digits=2)

b	power (n=100)	power (n=20)	power (n=10)
0.05	0.05	0.04	0.03
0.10	0.10	0.05	0.04
0.20	0.29	0.09	0.06
0.30	0.56	0.15	0.09
0.50	0.94	0.34	0.18
0.80	1.00	0.69	0.39
1.00	1.00	0.87	0.56
2.00	1.00	1.00	0.99

Absolute median effects

knitr::kable(res_table[, .("median(|b|) (n=100)"=median(abs(d100)),
              "median(|b|) (n=20)"=median(abs(d20)),
              "median(|b|) (n=10)"=median(abs(d10))
              ), by=b], digits=c(2, 3, 3, 3))

b	median(\|b\|) (n=100)	median(\|b\|) (n=20)	median(\|b\|) (n=10)
0.05	0.100	0.212	0.302
0.10	0.121	0.224	0.308
0.20	0.197	0.260	0.332
0.30	0.299	0.318	0.370
0.50	0.501	0.502	0.515
0.80	0.799	0.796	0.792
1.00	1.000	0.993	0.995
2.00	2.001	2.004	2.002

Inflation factors

knitr::kable(res_table[, .(
              "IF (n=100)" = median(abs(d100))/b,
              "IF (n=20)" = median(abs(d20))/b,
              "IF (n=10)" = median(abs(d10))/b
              ), by=b], digits=c(2, 1, 1, 1))

b	IF (n=100)	IF (n=20)	IF (n=10)
0.05	2.0	4.2	6.0
0.10	1.2	2.2	3.1
0.20	1.0	1.3	1.7
0.30	1.0	1.1	1.2
0.50	1.0	1.0	1.0
0.80	1.0	1.0	1.0
1.00	1.0	1.0	1.0
2.00	1.0	1.0	1.0

Directly answering question #2

Notice that if power is obove about .2, the absolute median effect is not inflated. That is, a study would have to be wicked underpowered for there to be an expected inflated effect size. This is an indirect answer to question no. 2. A more direct answer is explored by computing the log10 ratio of absolute effects between sample size levels for each run of the simulation.

res_table[, d100.d20:=log10(abs(d20)/abs(d100))]
res_table[, d100.d10:=log10(abs(d10)/abs(d100))]
res_table[, d20.d10:=log10(abs(d10)/abs(d20))]
res_melt <- melt(res_table[, .SD, .SDcols=c("b", "d100.d20", "d100.d10", "d20.d10")], id.vars="b", variable.name="comparison", value.name="contrast")
res_melt[, b:=factor(b)]
pd <- position_dodge(0.8)
ggplot(data=res_melt, aes(x=b, y=contrast, fill=comparison)) +
  geom_boxplot(position=pd, outlier.shape=NA) +
  coord_cartesian(ylim=c(-1.25, 2))

If the true effect is really small (0.05) then a smaller sample will often estimate a larger effect (just less than 75% of the time when decreasing \(n\) from 100 to 20). When the true effect is about 0.5 or higher, decreasing sample size is no more likely to estimate a bigger effect than increasing sample size.

The original question, and the motivating tweet, raise the question of what a “large” effect is. There is large in the absolute since, which would require subject level expertise to identify, and large relative to noise.↩
Note that the original post was not about the statistical significance filter but about the ethics of a RCT in which the observed effect was “large” but there was not enough power to get a statistically significant p-value.↩
at least if we are using an experiment to estimate an effect. If we are trying to estimate multiple effects, the bigger observed effects have tend to be inflated and the smaller observed effects tend to be dd↩

Should we be skeptical of a "large" effect size if p > 0.05?

Set up

The simulation

Power for each simulation combination

Absolute median effects

Inflation factors

Directly answering question #2

R doodles. Some ecology. Some physiology. Much fake data.

Should we be skeptical of a "large" effect size if p > 0.05?

Set up

The simulation

Power for each simulation combination

Absolute median effects

Inflation factors

Directly answering question #2

R doodles. Some ecology. Some physiology. Much fake data.

How to make plots with factor levels below the x-axis (bench-biology style)

What is an interaction?

How to estimate synergism or antagonism

Type 3 ANOVA in R -- an easy way to publish wrong tables

Linear models with a covariate ("ANCOVA")

Normal Q-Q plots - what is the robust line and should we prefer it?

ANCOVA when the covariate is a mediator affected by treatment

Bootstrap confidence intervals when sample size is really small

What is the consequence of normalizing by each case in the control?

Melting a list of columns