Copyright © Jeffrey A. Walker
keywords: q-q plot, model checking
knitr::opts_chunk$set(echo = TRUE,
message=FALSE,
warning = FALSE)
# import an messaging
library(here)
library(janitor)
library(readxl)
library(data.table)
# analysis
library(car) # qqPlot
library(qqplotr) # qq plot functions for ggplot
library(MASS) # rnegbin
# graphics and output
library(ggpubr) # qq plot functions
library(ggpubfigs) # palette
library(ggthemes) # themes and palettes including colorblind
library(ggforce)
library(cowplot)
here <- here::here
data_folder <- "content/data"
output_folder <- "content/output"
# Okabe & Ito palette
ito_seven <- friendly_pal("ito_seven") # ggpubfigs
pal_okabe_ito <- colorblind_pal()(8)[2:8] # ggthemes
Warning - This is a long, exploratory post on Q-Q plots motivated by the specific data set analyzed below and the code follows my stream of thinking this through. I have not gone back through to economize length. So yeh, some repeated code I’ve turned into functions and other repeated code is repeated.
This post is not about how to interpret a Q-Q plot but about which Q-Q plot? to interpret.
huh?
base R, ggpubr, ggplot2 compute the Q-Q line using the “quartiles” method - the line goes through the .25th and 0.75th quantiles. This line is a function of 2 of the n points.
qqPlot from the car package computes a “robust” Q-Q line using a robust regression through all points. At least the default qqPlot using a lm
object does this. If a vector of responses or residuals is input, then qqPlot defaults to the quartiles line. Regardless, a user can specify either.
ggpubr and ggplot2 (and base R?) compute a confidence band of the quartiles line using a parametric standard error.
qqPlot uses a parametric bootstrap to compute a confidence band if passing a lm
to the function. If a vector is passed, qqPlot uses a CI computed from a parametric SE. Again, a user can specify either.
To see this with qqPlot:
set.seed(1)
qqPlot(rnorm(20), id = FALSE) # quartiles
set.seed(1)
qqPlot(lm(rnorm(20) ~ 1), id = FALSE) # robust regression
The question pursued here is, which should a user use? The short answer is, the robust line with the parametric bootstrap confidence band from qqPlot.
data_from <- "Loss of p53 triggers WNT-dependent systemic inflammation to drive breast cancer metastasis"
file_name <- "41586_2019_1450_MOESM3_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)
treatment_levels <- c("Trp53+/+", "Trp53-/-")
fig1f <- read_excel(file_path,
sheet = "Fig. 1f",
range = "A2:B23") %>%
data.table() %>%
melt(measure.vars = treatment_levels,
variable.name = "treatment",
value.name = "il1beta")
fig1f[, treatment := factor(treatment, treatment_levels)]
# be careful of the missing data. This can create mismatch between id and residual unless specified in lm
# head(fig1f)
m1 <- lm(il1beta ~ treatment, data = fig1f)
# residuals
m1_res <- residuals(m1)
# studentized residuals
sigma_m1_res <- sd(m1_res) # will use this again
m1_res_st <- m1_res/sigma_m1_res
n <- length(m1_res)
# base R plot
qqnorm(m1_res)
qqline(m1_res)
# ggpubr plot
ggqqplot(data = data.table(m1_res = m1_res),
x = "m1_res")
## car qqPlot
# car plot
qqPlot(m1,
simulate = FALSE,
id = FALSE)
Q-Q plots generated by base R, ggpubr::ggqqplot
and car::qqPlot
are shown. The base R and ggqqplot
functions generate the same line. ggplot2::geom_qqline
also generates this line (not shown, but try it!). The default qqPlot
line differs from these three. This default is line = "robust"
. If this argument is changed to line = "quartiles"
, the qqPlot line is the same as base R/ggqqplot. The difference between “robust” and base R/ggqqplot/“quartiles” is pretty easy to see with these data. Look at points 2-4 starting from the left. In the base R/ggqqplot, points 2 is slightly above the line, point 3 is slighty below the line, and point 4 is above the line. In the qqPlot
, all three points are above the line. Even more striking is the confidence interval. In the ggqqplot, the upper seven (rightmost) points are well outside the interval – we would conclude that a sample from a Normal distribution would be unlikely to generate this pattern of quantiles and we should consider modeling with glm or a transformation, etc. In the qqPlot, these seven points are within or just outside the interval – we would conclude that the a sample from a Normal would result in something like these quantiles at a common enough frequency to not worry about modeling with a glm or transformation or whatever.
Two different interepretations, two different “decisions”. Which should decision should we decide on? I’ll return to that. But first, some sleuthing.
Being curious what the parameters of these lines are, I went to the source – the help page for qqPlot
. The text for the argument line
is
“quartiles” to pass a line through the quartile-pairs, or “robust” for a robust-regression line; the latter uses the rlm function in the MASS package.
Okay so let’s make three lines
MASS::rlm
. This function has several arguments. I used the defaults.In addition to these, I used a parametric bootstrap to create my own 95% confidence band of the sample quantiles. For this, I sampled \(n\) points from a standard, Normal distribution and plotted the Q-Q of this sample. I repeated this 1000 times to generate the distributions of expected sample quantiles for at each of the \(n\) theoretical quantile. The 95% CI was computed as the 2.5th and 97.5th percentiles of each of these distributions.
normal_qq <- ppoints(n) %>%
qnorm()
sample_qq <- m1_res[order(m1_res)]
# mean + sd
parametric_slope <- sd(sample_qq)
parametric_intercept <- mean(sample_qq)
# quartiles
m1_quartiles <- quantile(m1_res, c(.25, .75))
qnorm_quartiles <- qnorm( c(.25, .75))
m1_diff <- m1_quartiles[2] - m1_quartiles[1]
qnorm_diff <- qnorm_quartiles[2] - qnorm_quartiles[1] # = 1.349
quartile_slope <- m1_diff/qnorm_diff
quartile_intercept <- median(m1_quartiles) # median of quartiles not quantiles
# robust uses MASS:rlm (default arguments?)
qq_observed <- data.table(normal_qq = normal_qq,
sample_qq = sample_qq)
m2 <- rlm(sample_qq ~ normal_qq, data = qq_observed)
robust_intercept <- coef(m2)[1]
robust_slope <- coef(m2)[2]
# re-sample ribbon
n_iter <- 1000
resample_qq <- numeric(n_iter*n)
inc <- 1:n
for(iter in 1:n_iter){
y <- rnorm(n, sd = sigma_m1_res)
y_res <- y - mean(y)
resample_qq[inc] <- y_res[order(y_res)]
inc <- inc + n
}
qq_sim <- data.table(normal_qq = normal_qq,
resample_qq = resample_qq)
qq_ci <- qq_sim[, .(median = median(resample_qq),
lower = quantile(resample_qq, 0.025),
upper = quantile(resample_qq, 0.975)),
by = normal_qq]
ggplot(data = qq_observed,
aes(x = normal_qq, y = sample_qq)) +
# ribbon
geom_ribbon(data = qq_ci,
aes(ymin = lower,
ymax = upper,
y = median,
fill = "band"),
fill = "gray",
alpha = 0.6) +
# draw points
geom_point() +
# ggplot's qq line
geom_qq_line(aes(sample = sample_qq,
color = "geom_qq_line"),
show.legend=TRUE,
size = 0.75) +
# parametric
geom_abline(aes(intercept = parametric_intercept,
slope = parametric_slope,
color = "parametric"),
show.legend=TRUE,
size = 0.75) +
# quartile
geom_abline(aes(intercept = quartile_intercept,
slope = quartile_slope,
color = "quartile"),
show.legend=TRUE,
size = 0.75,
linetype = "dashed") +
# robust
geom_abline(aes(intercept = robust_intercept,
slope = robust_slope,
color = "robust"),
show.legend = TRUE,
size = 0.75) +
scale_color_manual(values = pal_okabe_ito[c(1:2,5:6)]) +
theme_minimal_grid() +
NULL
Huh! The quartiles line, which is used by base R, ggpubr, and ggplot2, falls well outside the upper half of the 95% confidence band of sample quantiles sampled from a normal distribution. The parametric and robust lines slice more through the middle of the band, with the parametric nearly bisecting the band.
In sleuthing out the parameters defining the lines in the different qqplot functions, in addition to the qqPlot
help page, I discoverd Tristan Mahr’s excellent blog post Q-Q Plots and Worm Plots from Scratch.
What confused me about Tristan’s results was the equivalence between the car::qqPlot line and a “robust” line that Tristan built by setting the robust slope to the interquartile interval divided by 1.349. This raised one minor and one major question.
Minor: where does the 1.349 come from? I discovered this somewhat accidentally by looking at the intermediate values in my computation. The 1.349 comes from the denominator of the slope computed by the theoretical quartiles
qnorm_quartiles <- qnorm( c(.25, .75))
(qnorm_diff <- qnorm_quartiles[2] - qnorm_quartiles[1])
## [1] 1.34898
Major: Tristan’s “robust” computation is my “quartiles” computation, which I got from the help page of car::qqPlot
. Tristan calls the quartiles line “robust” due to this quote from John Fox, the author of qqPlot and the book Applied Regression Analysis and Generalized Linear Models
We can alternatively use the median as a robust estimator of [the mean] and the interquartile range / 1.349 as a robust estimator of [the standard deviation]. (The more conventional estimates [of the sample mean and SD] will not work well when the data are substantially non-normal.) [p. 39]
In my example, the qqPlot default is equivalent to my line computed using robust regression (MASS::rlm
), which I got from the qqPlot help page. In Tristan’s code, the qqPlot default is equivalent to his line computed using the sample quartiles. How could these both be TRUE?
The answer took a bit more sleuthing. In my qqPlot code, I used the S3 method for class ‘lm’ and passed the fit linear model from lm
. Tristan used the default s3 method and passed the vector of samples. Curiously, the default line in qqPlot differs between these, and this is in the help page. Compare:
qqPlot(m1, # uses line = "robust" by default
simulate = FALSE,
id = FALSE)
qqPlot(m1_res, # uses line = "quartiles" by default
id = FALSE)
Regardless, the line computed in Tristan’s code is not the “robust” line but the “quartiles” line, despite Fox using the word “robust” in the explanation of the quartiles line.
# https://www.tjmahr.com/quantile-quantile-plots-from-scratch/
se_z <- function(z, n) {
sqrt(pnorm(z) * (1 - pnorm(z)) / n) / dnorm(z)
}
compare_qq <- function(fake_y,
add_boot_ci = TRUE,
add_quartiles_ci = FALSE,
studentized = FALSE,
boot_iter = 1000,
plot_it = TRUE){
n <- length(fake_y)
m1 <- lm(fake_y ~ 1)
m1_res <- residuals(m1)
if(studentized == TRUE){m1_res <- m1_res/sd(m1_res)}
sigma_m1_res <- sd(m1_res)
normal_qq <- ppoints(n) %>%
qnorm()
sample_qq <- m1_res[order(m1_res)]
# mean + sd
parametric_slope <- sd(sample_qq)
parametric_intercept <- mean(sample_qq)
# quartiles
m1_quartiles <- quantile(m1_res, c(.25, .75))
qnorm_quartiles <- qnorm( c(.25, .75))
m1_diff <- m1_quartiles[2] - m1_quartiles[1]
qnorm_diff <- qnorm_quartiles[2] - qnorm_quartiles[1] # = 1.349
quartile_slope <- m1_diff/qnorm_diff
quartile_intercept <- median(m1_quartiles)
# robust uses MASS:rlm (default arguments?)
qq_observed <- data.table(normal_qq = normal_qq,
sample_qq = sample_qq)
m2 <- rlm(sample_qq ~ normal_qq, data = qq_observed)
robust_intercept <- coef(m2)[1]
robust_slope <- coef(m2)[2]
# re-sample ribbon
n_iter <- 1000
resample_qq <- numeric(n_iter*n)
inc <- 1:n
for(iter in 1:boot_iter){
y <- rnorm(n, sd = sigma_m1_res)
y_res <- y - mean(y)
resample_qq[inc] <- y_res[order(y_res)]
inc <- inc + n
}
qq_sim <- data.table(normal_qq = normal_qq,
resample_qq = resample_qq)
qq_ci <- qq_sim[, .(median = median(resample_qq),
lower = quantile(resample_qq, 0.025),
upper = quantile(resample_qq, 0.975)),
by = normal_qq]
# quartile ribbon
qq_ci[, expected_quartile := quartile_intercept +
quartile_slope*normal_qq]
qq_ci[, quartile_se := quartile_slope*se_z(normal_qq, n)]
qq_ci[, quartile_lwr := expected_quartile - 1.96*quartile_se]
qq_ci[, quartile_upr := expected_quartile + 1.96*quartile_se]
# plot
gg <- ggplot(data = qq_observed,
aes(x = normal_qq, y = sample_qq))
if(add_boot_ci == TRUE){
gg <- gg +
geom_ribbon(data = qq_ci,
aes(ymin = lower,
ymax = upper,
y = median,
fill = "band"),
fill = pal_okabe_ito[5],
alpha = 0.4)
}
if(add_quartiles_ci == TRUE){
gg <- gg +
geom_ribbon(data = qq_ci,
aes(ymin = quartile_lwr,
ymax = quartile_upr,
y = expected_quartile,
fill = "band"),
fill = pal_okabe_ito[1],
alpha = 0.4)
}
gg <- gg +
# draw points
geom_point() +
# ggplot's qq line
geom_qq_line(aes(sample = sample_qq,
color = "geom_qq_line"),
show.legend=TRUE,
size = 0.75) +
# quartile
geom_abline(aes(intercept = quartile_intercept,
slope = quartile_slope,
color = "quartile"),
show.legend=TRUE,
size = 0.75,
linetype = "dashed") +
# robust
geom_abline(aes(intercept = robust_intercept,
slope = robust_slope,
color = "robust"),
show.legend = TRUE,
size = 0.75) +
# parametric
# geom_abline(aes(intercept = parametric_intercept,
# slope = parametric_slope,
# color = "parametric"),
# show.legend=TRUE,
# size = 0.75) +
scale_color_manual(values = pal_okabe_ito[c(6,1,5)]) +
theme_minimal_grid() +
NULL
if(plot_it==TRUE){
return(gg)
}else{
return(abs(robust_slope - quartile_slope))
}
}
n <- 50
fake_y <- rnorm(n)
fake_y <- rnegbin(n, mu = 10, theta = 1)
compare_qq(fake_y, add_boot_ci = TRUE, add_quartiles_ci = TRUE)
n_sim <- 1000
slope_diff <- numeric(n_sim)
for(sim_i in 1:n_sim){
set.seed(sim_i)
fake_y <- rnorm(n=20)
#fake_y <- rnegbin(n = 50, mu = 10, theta = 1)
slope_diff[sim_i] <- compare_qq(fake_y,
add_boot_ci = TRUE,
add_quartiles_ci = TRUE,
plot_it = FALSE) # return abs slope difference
}
max_i <- which(rank(slope_diff) > 990)
# [1] 282 334 534 674 715 774 857 866 923 974
# use these to explore
set.seed(max_i[10])
fake_y <- rnorm(n=20)
compare_qq(fake_y,
add_boot_ci = TRUE,
add_quartiles_ci = TRUE)
# 1 - quartile interp: fat tails, totally within bootstrap
# 2 - same as #1
# 3 - quartiles: right skew. wicked off. robust: maybe right skew. close
# 4 - same as #1
# 5 - same as #1
# 6 - same as #1
# 7 - both okay
# 8 - same as #1
# 9 - same as #1
# 10 - both okay
This is a simulation to look at the coverage properties of the confidence band of the quartiles line and of the robust line. The CI band of the robust line is computed from the parametric bootstrap, and really has nothing to do with the line itself but the two are paired as defaults in car::qqPlot.
The frequency that is reported is the frequency of simulated data sets in which p or more points lie outside of the confidence band. This frequency is computed for three different values of p – \(0.1n\), \(0.2n\), \(0.4n\), where \(n\) is the sample size. Coverage performance was simulated at three levels of \(n\) – 10, 20, 40. Coverage was also simulated for both a normal distribution and a non-normal (negative binomial) distribution. For the non-normal distribution, the frequency of simulated data sets with points outside the band should be big – bigger is better. For the normal distribution, the frequency of simulated data sets with points outside the band should be small – smaller is better. It may seem like a frequency closer to 0.05 is better (since these are 95% intervals) but there is weird, correlated multiplicity effects going on.
The major result is,
and
Interesting. A smaller false-positive error does not penalize the parametric bootstrap’s sensitivity.
boot_qq_band <- function(n = 100,
sigma = 1,
n_iter = 1000){
resample_qq <- numeric(n_iter*n)
inc <- 1:n
for(iter in 1:n_iter){
y <- rnorm(n, sd = sigma)
y_res <- y - mean(y)
resample_qq[inc] <- y_res[order(y_res)]
inc <- inc + n
}
normal_qq <- ppoints(n) %>%
qnorm()
qq_sim <- data.table(normal_qq = normal_qq,
resample_qq = resample_qq)
qq_ci <- qq_sim[, .(median = median(resample_qq),
lower = quantile(resample_qq, 0.025),
upper = quantile(resample_qq, 0.975)),
by = normal_qq]
return(qq_ci)
}
qq_line_coverage <- function(n_sim = 1000,
n = 20,
normal = TRUE){
# n_sim <- 1000
# n = 20
# normal <- TRUE
sim_results <- matrix(nrow = n_sim, ncol = 4)
for(iter in 1:n_sim){
if(normal == TRUE){
fake_y <- rnorm(n)
}else{
fake_y <- rnegbin(n, mu = 10, theta = 1)
}
m1_res <- fake_y - mean(fake_y)
fake_sd <- sd(fake_y)
qq_table <- data.table(
sample_qq = m1_res[order(m1_res)],
boot_qq_band(n, sigma = fake_sd)
)
# quartile line
m1_quartiles <- quantile(m1_res, c(.25, .75))
m1_diff <- m1_quartiles[2] - m1_quartiles[1]
quartile_slope <- m1_diff/1.349
quartile_intercept <- median(m1_quartiles)
qq_table[, expected_quartile := quartile_intercept +
quartile_slope*normal_qq]
qq_table[, quartile_se := quartile_slope*se_z(normal_qq, n)]
qq_table[, quartile_lwr := expected_quartile - 1.96*quartile_se]
qq_table[, quartile_upr := expected_quartile + 1.96*quartile_se]
# robust line
m2 <- rlm(sample_qq ~ normal_qq, data = qq_table)
qq_table[, expected_robust := coef(m2)[1] +
coef(m2)[2]*normal_qq]
res <-
qq_table[, .(quartiles_pos = sum(sample_qq < quartile_lwr) +
sum(sample_qq > quartile_upr),
boot_pos = sum(sample_qq < lower) +
sum(sample_qq > upper),
quartiles_MAD = mean(abs(sample_qq - expected_quartile)/
(quartile_upr - quartile_lwr)),
boot_MAD = mean(abs(sample_qq - expected_robust)/
(upper - lower))
)]
sim_results[iter,] <- unlist(res[1])
}
colnames(sim_results) <- c("quartiles_pos",
"boot_pos",
"quartiles_MAD",
"boot_MAD")
sim_results <- data.table(sim_results)
}
have_i_simulated_it <- "yes"
out_path <- here(output_folder, "qq_sim.Rds")
if(have_i_simulated_it == "no"){
coverage_summary <- data.table(NULL)
n_list <- c(10, 20, 40)
model_list <- c(TRUE, FALSE) # TRUE = normal
sim_combi <- expand.grid(n = n_list,
model = model_list) %>%
data.table
for(i in 1:nrow(sim_combi)){
n_i <- sim_combi[i, n]
model_i <- sim_combi[i, model]
coverage_res <- qq_line_coverage(n = n_i, normal = model_i)
p_list <- round(c(n_i/10, n_i/5, n_i/2.5), 0)
for(p_i in p_list){
res <- coverage_res[,
.(freq_pos_quartiles = sum(quartiles_pos >= p_i)/n_sim,
freq_pos_boot = sum(boot_pos >= p_i )/n_sim)]
coverage_summary <- rbind(coverage_summary,
data.table(n = n_i,
normal = model_i,
p = p_i,
res))
}
}
saveRDS(coverage_summary, out_path)
}else{
coverage_summary <- readRDS(out_path)
}
knitr::kable(coverage_summary)
n | normal | p | freq_pos_quartiles | freq_pos_boot |
---|---|---|---|---|
10 | TRUE | 1 | 0.296 | 0.173 |
10 | TRUE | 2 | 0.144 | 0.028 |
10 | TRUE | 4 | 0.018 | 0.001 |
20 | TRUE | 2 | 0.165 | 0.105 |
20 | TRUE | 4 | 0.047 | 0.019 |
20 | TRUE | 8 | 0.003 | 0.000 |
40 | TRUE | 4 | 0.108 | 0.097 |
40 | TRUE | 8 | 0.031 | 0.018 |
40 | TRUE | 16 | 0.000 | 0.000 |
10 | FALSE | 1 | 0.516 | 0.599 |
10 | FALSE | 2 | 0.204 | 0.302 |
10 | FALSE | 4 | 0.001 | 0.085 |
20 | FALSE | 2 | 0.477 | 0.824 |
20 | FALSE | 4 | 0.154 | 0.622 |
20 | FALSE | 8 | 0.002 | 0.248 |
40 | FALSE | 4 | 0.826 | 0.991 |
40 | FALSE | 8 | 0.357 | 0.950 |
40 | FALSE | 16 | 0.003 | 0.634 |