1 The motivating data

2 Fit a linear model

3 Q-Q plots

3.1 Base R
3.2 ggpubr
3.3 What are “robust” and “quartiles” lines? Building a Q-Q plot manually

4 What’s going on in Tristan Mahr’s results?

5 What about non-Normal distributions?

5.1 Manual explore
5.2 Systematic exploration of cases where quartiles and robust lines are far apart

6 Simulation

Back to R doodles

keywords: q-q plot, model checking

knitr::opts_chunk$set(echo = TRUE,
                      message=FALSE,
                      warning = FALSE)
# import an messaging
library(here)
library(janitor)
library(readxl)
library(data.table)

# analysis
library(car) # qqPlot
library(qqplotr) # qq plot functions for ggplot
library(MASS) # rnegbin

# graphics and output
library(ggpubr) # qq plot functions
library(ggpubfigs) # palette
library(ggthemes) # themes and palettes including colorblind
library(ggforce)
library(cowplot)

here <- here::here
data_folder <- "content/data"
output_folder <- "content/output"

# Okabe & Ito palette
ito_seven <- friendly_pal("ito_seven") # ggpubfigs
pal_okabe_ito <- colorblind_pal()(8)[2:8] # ggthemes

Warning - This is a long, exploratory post on Q-Q plots motivated by the specific data set analyzed below and the code follows my stream of thinking this through. I have not gone back through to economize length. So yeh, some repeated code I’ve turned into functions and other repeated code is repeated.

This post is not about how to interpret a Q-Q plot but about which Q-Q plot? to interpret.

huh?

base R, ggpubr, ggplot2 compute the Q-Q line using the “quartiles” method - the line goes through the .25th and 0.75th quantiles. This line is a function of 2 of the n points.
qqPlot from the car package computes a “robust” Q-Q line using a robust regression through all points. At least the default qqPlot using a lm object does this. If a vector of responses or residuals is input, then qqPlot defaults to the quartiles line. Regardless, a user can specify either.
ggpubr and ggplot2 (and base R?) compute a confidence band of the quartiles line using a parametric standard error.
qqPlot uses a parametric bootstrap to compute a confidence band if passing a lm to the function. If a vector is passed, qqPlot uses a CI computed from a parametric SE. Again, a user can specify either.

To see this with qqPlot:

set.seed(1)
qqPlot(rnorm(20), id = FALSE) # quartiles

set.seed(1)
qqPlot(lm(rnorm(20) ~ 1), id = FALSE) # robust regression

The question pursued here is, which should a user use? The short answer is, the robust line with the parametric bootstrap confidence band from qqPlot.

1 The motivating data

Source: Wellenstein, M.D., Coffelt, S.B., Duits, D.E., van Miltenburg, M.H., Slagter, M., de Rink, I., Henneman, L., Kas, S.M., Prekovic, S., Hau, C.S. and Vrijland, K., 2019. Loss of p53 triggers WNT-dependent systemic inflammation to drive breast cancer metastasis. Nature, 572(7770), pp.538-542.

Public source

Data source

data_from <- "Loss of p53 triggers WNT-dependent systemic inflammation to drive breast cancer metastasis"
file_name <- "41586_2019_1450_MOESM3_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)
  
treatment_levels <- c("Trp53+/+", "Trp53-/-")
fig1f <- read_excel(file_path,
                     sheet = "Fig. 1f",
                     range = "A2:B23") %>%
  data.table() %>%
  melt(measure.vars = treatment_levels,
       variable.name = "treatment",
       value.name = "il1beta")
fig1f[, treatment := factor(treatment, treatment_levels)]

# be careful of the missing data. This can create mismatch between id and residual unless specified in lm

# head(fig1f)

2 Fit a linear model

m1 <- lm(il1beta ~ treatment, data = fig1f)

# residuals
m1_res <- residuals(m1)

# studentized residuals
sigma_m1_res <- sd(m1_res) # will use this again
m1_res_st <- m1_res/sigma_m1_res

n <- length(m1_res)

3 Q-Q plots

3.1 Base R

# base R plot
qqnorm(m1_res)
qqline(m1_res)

3.2 ggpubr

# ggpubr plot
ggqqplot(data = data.table(m1_res = m1_res),
         x = "m1_res")

## car qqPlot

# car plot

qqPlot(m1, 
       simulate = FALSE,
       id = FALSE)

Q-Q plots generated by base R, ggpubr::ggqqplot and car::qqPlot are shown. The base R and ggqqplot functions generate the same line. ggplot2::geom_qqline also generates this line (not shown, but try it!). The default qqPlot line differs from these three. This default is line = "robust". If this argument is changed to line = "quartiles", the qqPlot line is the same as base R/ggqqplot. The difference between “robust” and base R/ggqqplot/“quartiles” is pretty easy to see with these data. Look at points 2-4 starting from the left. In the base R/ggqqplot, points 2 is slightly above the line, point 3 is slighty below the line, and point 4 is above the line. In the qqPlot, all three points are above the line. Even more striking is the confidence interval. In the ggqqplot, the upper seven (rightmost) points are well outside the interval – we would conclude that a sample from a Normal distribution would be unlikely to generate this pattern of quantiles and we should consider modeling with glm or a transformation, etc. In the qqPlot, these seven points are within or just outside the interval – we would conclude that the a sample from a Normal would result in something like these quantiles at a common enough frequency to not worry about modeling with a glm or transformation or whatever.

Two different interepretations, two different “decisions”. Which should decision should we decide on? I’ll return to that. But first, some sleuthing.

3.3 What are “robust” and “quartiles” lines? Building a Q-Q plot manually

Being curious what the parameters of these lines are, I went to the source – the help page for qqPlot. The text for the argument line is

“quartiles” to pass a line through the quartile-pairs, or “robust” for a robust-regression line; the latter uses the rlm function in the MASS package.

Okay so let’s make three lines

parametric – the regression line of the sample quantiles on the theoretical quantiles
quartiles – the regression line through the sample quartiles (the quantiles at 25% and 75%). Not sure what the intercept is. A little playing indicates that this is the median of the sample quartiles (not the sample quantiles). Obvious in hindsight?
robust – the regression line using MASS::rlm. This function has several arguments. I used the defaults.

In addition to these, I used a parametric bootstrap to create my own 95% confidence band of the sample quantiles. For this, I sampled points from a standard, Normal distribution and plotted the Q-Q of this sample. I repeated this 1000 times to generate the distributions of expected sample quantiles for at each of the theoretical quantile. The 95% CI was computed as the 2.5th and 97.5th percentiles of each of these distributions.

normal_qq <- ppoints(n) %>%
  qnorm()
sample_qq <- m1_res[order(m1_res)]

# mean + sd
parametric_slope <- sd(sample_qq)
parametric_intercept <- mean(sample_qq)

# quartiles
m1_quartiles <- quantile(m1_res, c(.25, .75))
qnorm_quartiles <- qnorm( c(.25, .75))
m1_diff <- m1_quartiles[2] - m1_quartiles[1]
qnorm_diff <- qnorm_quartiles[2] - qnorm_quartiles[1] # = 1.349
quartile_slope <- m1_diff/qnorm_diff
quartile_intercept <- median(m1_quartiles) # median of quartiles not quantiles

# robust uses MASS:rlm (default arguments?)
qq_observed <- data.table(normal_qq = normal_qq,
                    sample_qq = sample_qq)
m2 <- rlm(sample_qq ~ normal_qq, data = qq_observed)
robust_intercept <- coef(m2)[1]
robust_slope <- coef(m2)[2]

# re-sample ribbon
n_iter <- 1000
resample_qq <- numeric(n_iter*n)
inc <- 1:n
for(iter in 1:n_iter){
  y <- rnorm(n, sd = sigma_m1_res)
  y_res <- y - mean(y)
  resample_qq[inc] <- y_res[order(y_res)]
  inc <- inc + n
}

qq_sim <- data.table(normal_qq = normal_qq,
                     resample_qq = resample_qq)

qq_ci <- qq_sim[, .(median = median(resample_qq),
                     lower = quantile(resample_qq, 0.025),
                     upper = quantile(resample_qq, 0.975)),
                 by = normal_qq]

ggplot(data = qq_observed,
       aes(x = normal_qq, y = sample_qq)) +
  
  # ribbon
  geom_ribbon(data = qq_ci,
              aes(ymin = lower,
                  ymax = upper,
                  y = median,
                  fill = "band"),
              fill = "gray",
              alpha = 0.6) +
  # draw points
  geom_point() +
  
  # ggplot's qq line
  geom_qq_line(aes(sample = sample_qq,
                   color = "geom_qq_line"),
               show.legend=TRUE,
               size = 0.75) +
  # parametric
  geom_abline(aes(intercept = parametric_intercept,
                  slope = parametric_slope,
                  color = "parametric"),
              show.legend=TRUE,
              size = 0.75) +
  # quartile
  geom_abline(aes(intercept = quartile_intercept,
                  slope = quartile_slope,
                  color = "quartile"),
              show.legend=TRUE,
              size = 0.75,
              linetype = "dashed") +
  # robust
  geom_abline(aes(intercept = robust_intercept,
                  slope = robust_slope,
                  color = "robust"),
              show.legend = TRUE,
              size = 0.75) +
  
  
  scale_color_manual(values = pal_okabe_ito[c(1:2,5:6)]) +
  theme_minimal_grid() +
  

  
  NULL

Huh! The quartiles line, which is used by base R, ggpubr, and ggplot2, falls well outside the upper half of the 95% confidence band of sample quantiles sampled from a normal distribution. The parametric and robust lines slice more through the middle of the band, with the parametric nearly bisecting the band.

4 What’s going on in Tristan Mahr’s results?

In sleuthing out the parameters defining the lines in the different qqplot functions, in addition to the qqPlot help page, I discoverd Tristan Mahr’s excellent blog post Q-Q Plots and Worm Plots from Scratch.

What confused me about Tristan’s results was the equivalence between the car::qqPlot line and a “robust” line that Tristan built by setting the robust slope to the interquartile interval divided by 1.349. This raised one minor and one major question.

Minor: where does the 1.349 come from? I discovered this somewhat accidentally by looking at the intermediate values in my computation. The 1.349 comes from the denominator of the slope computed by the theoretical quartiles

qnorm_quartiles <- qnorm( c(.25, .75))
(qnorm_diff <- qnorm_quartiles[2] - qnorm_quartiles[1])

## [1] 1.34898

Major: Tristan’s “robust” computation is my “quartiles” computation, which I got from the help page of car::qqPlot. Tristan calls the quartiles line “robust” due to this quote from John Fox, the author of qqPlot and the book Applied Regression Analysis and Generalized Linear Models

We can alternatively use the median as a robust estimator of [the mean] and the interquartile range / 1.349 as a robust estimator of [the standard deviation]. (The more conventional estimates [of the sample mean and SD] will not work well when the data are substantially non-normal.) [p. 39]

In my example, the qqPlot default is equivalent to my line computed using robust regression (MASS::rlm), which I got from the qqPlot help page. In Tristan’s code, the qqPlot default is equivalent to his line computed using the sample quartiles. How could these both be TRUE?

The answer took a bit more sleuthing. In my qqPlot code, I used the S3 method for class ‘lm’ and passed the fit linear model from lm. Tristan used the default s3 method and passed the vector of samples. Curiously, the default line in qqPlot differs between these, and this is in the help page. Compare:

qqPlot(m1, # uses line = "robust" by default
       simulate = FALSE,
       id = FALSE)

qqPlot(m1_res, # uses line = "quartiles" by default
       id = FALSE)

Regardless, the line computed in Tristan’s code is not the “robust” line but the “quartiles” line, despite Fox using the word “robust” in the explanation of the quartiles line.

5 What about non-Normal distributions?

# https://www.tjmahr.com/quantile-quantile-plots-from-scratch/
se_z <- function(z, n) {
  sqrt(pnorm(z) * (1 - pnorm(z)) / n) / dnorm(z)
}

compare_qq <- function(fake_y,
                       add_boot_ci = TRUE,
                       add_quartiles_ci = FALSE,
                       studentized = FALSE,
                       boot_iter = 1000,
                       plot_it = TRUE){
  n <- length(fake_y)
  m1 <- lm(fake_y ~ 1)
  m1_res <- residuals(m1)
  if(studentized == TRUE){m1_res <- m1_res/sd(m1_res)}
  sigma_m1_res <- sd(m1_res)
  
  normal_qq <- ppoints(n) %>%
    qnorm()
  sample_qq <- m1_res[order(m1_res)]
  
  # mean + sd
  parametric_slope <- sd(sample_qq)
  parametric_intercept <- mean(sample_qq)
  
  # quartiles
  m1_quartiles <- quantile(m1_res, c(.25, .75))
  qnorm_quartiles <- qnorm( c(.25, .75))
  m1_diff <- m1_quartiles[2] - m1_quartiles[1]
  qnorm_diff <- qnorm_quartiles[2] - qnorm_quartiles[1] # = 1.349
  quartile_slope <- m1_diff/qnorm_diff
  quartile_intercept <- median(m1_quartiles)
  
  # robust uses MASS:rlm (default arguments?)
  qq_observed <- data.table(normal_qq = normal_qq,
                            sample_qq = sample_qq)
  m2 <- rlm(sample_qq ~ normal_qq, data = qq_observed)
  robust_intercept <- coef(m2)[1]
  robust_slope <- coef(m2)[2]
  
  # re-sample ribbon
  n_iter <- 1000
  resample_qq <- numeric(n_iter*n)
  inc <- 1:n
  
  for(iter in 1:boot_iter){
    y <- rnorm(n, sd = sigma_m1_res)
    y_res <- y - mean(y)
    resample_qq[inc] <- y_res[order(y_res)]
    inc <- inc + n
  }
  
  qq_sim <- data.table(normal_qq = normal_qq,
                       resample_qq = resample_qq)
  
  qq_ci <- qq_sim[, .(median = median(resample_qq),
                      lower = quantile(resample_qq, 0.025),
                      upper = quantile(resample_qq, 0.975)),
                  by = normal_qq]
  
  # quartile ribbon
  qq_ci[, expected_quartile := quartile_intercept +
             quartile_slope*normal_qq]
  qq_ci[, quartile_se := quartile_slope*se_z(normal_qq, n)]
  qq_ci[, quartile_lwr := expected_quartile - 1.96*quartile_se]
  qq_ci[, quartile_upr := expected_quartile + 1.96*quartile_se]
  
  # plot
  
  gg <- ggplot(data = qq_observed,
               aes(x = normal_qq, y = sample_qq))
  if(add_boot_ci == TRUE){
    gg <- gg +
      geom_ribbon(data = qq_ci,
                  aes(ymin = lower,
                      ymax = upper,
                      y = median,
                      fill = "band"),
                  fill = pal_okabe_ito[5],
                  alpha = 0.4)
  }
  if(add_quartiles_ci == TRUE){
    gg <- gg +
      geom_ribbon(data = qq_ci,
                  aes(ymin = quartile_lwr,
                      ymax = quartile_upr,
                      y = expected_quartile,
                      fill = "band"),
                  fill = pal_okabe_ito[1],
                  alpha = 0.4)
  }
    
gg <- gg +

    # draw points
    geom_point() +
    
    # ggplot's qq line
    geom_qq_line(aes(sample = sample_qq,
                     color = "geom_qq_line"),
                 show.legend=TRUE,
                 size = 0.75) +
    # quartile
    geom_abline(aes(intercept = quartile_intercept,
                    slope = quartile_slope,
                    color = "quartile"),
                show.legend=TRUE,
                size = 0.75,
                linetype = "dashed") +
    # robust
    geom_abline(aes(intercept = robust_intercept,
                    slope = robust_slope,
                    color = "robust"),
                show.legend = TRUE,
                size = 0.75) +
    # parametric
    # geom_abline(aes(intercept = parametric_intercept,
    #                 slope = parametric_slope,
    #                 color = "parametric"),
    #             show.legend=TRUE,
    #             size = 0.75) +
    
    
    scale_color_manual(values = pal_okabe_ito[c(6,1,5)]) +
    theme_minimal_grid() +
    
    NULL
  
  if(plot_it==TRUE){
    return(gg)
  }else{
    return(abs(robust_slope - quartile_slope))
  }
}

5.1 Manual explore

n <- 50
fake_y <- rnorm(n)
fake_y <- rnegbin(n, mu = 10, theta = 1)
compare_qq(fake_y, add_boot_ci = TRUE, add_quartiles_ci = TRUE)

5.2 Systematic exploration of cases where quartiles and robust lines are far apart

n_sim <- 1000
slope_diff <- numeric(n_sim)
for(sim_i in 1:n_sim){
  set.seed(sim_i)
  fake_y <- rnorm(n=20)
  #fake_y <- rnegbin(n = 50, mu = 10, theta = 1)
  slope_diff[sim_i] <- compare_qq(fake_y,
             add_boot_ci = TRUE,
             add_quartiles_ci = TRUE,
             plot_it = FALSE) # return abs slope difference
}

max_i <- which(rank(slope_diff) > 990)
# [1] 282 334 534 674 715 774 857 866 923 974

# use these to explore
set.seed(max_i[10])
fake_y <- rnorm(n=20)
compare_qq(fake_y,
           add_boot_ci = TRUE,
           add_quartiles_ci = TRUE)

# 1 - quartile interp: fat tails, totally within bootstrap
# 2 - same as #1
# 3 - quartiles: right skew. wicked off. robust: maybe right skew. close
# 4 - same as #1
# 5 - same as #1
# 6 - same as #1
# 7 - both okay
# 8 - same as #1
# 9 - same as #1
# 10 - both okay

6 Simulation

This is a simulation to look at the coverage properties of the confidence band of the quartiles line and of the robust line. The CI band of the robust line is computed from the parametric bootstrap, and really has nothing to do with the line itself but the two are paired as defaults in car::qqPlot.

The frequency that is reported is the frequency of simulated data sets in which p or more points lie outside of the confidence band. This frequency is computed for three different values of p – , , , where is the sample size. Coverage performance was simulated at three levels of – 10, 20, 40. Coverage was also simulated for both a normal distribution and a non-normal (negative binomial) distribution. For the non-normal distribution, the frequency of simulated data sets with points outside the band should be big – bigger is better. For the normal distribution, the frequency of simulated data sets with points outside the band should be small – smaller is better. It may seem like a frequency closer to 0.05 is better (since these are 95% intervals) but there is weird, correlated multiplicity effects going on.

The major result is,

the parametric bootstrap band has lower frequencies of data sets with points outside the band when the data are sampled from a normal distribution

and

the parametric bootstrap band has higher frequencies of data sets with points outside the band when the data are sampled from a non-normal (a right-skewed negative binomial) distribution

Interesting. A smaller false-positive error does not penalize the parametric bootstrap’s sensitivity.

boot_qq_band <- function(n = 100,
                         sigma = 1,
                         n_iter = 1000){
  resample_qq <- numeric(n_iter*n)
  inc <- 1:n
  
  for(iter in 1:n_iter){
    y <- rnorm(n, sd = sigma)
    y_res <- y - mean(y)
    resample_qq[inc] <- y_res[order(y_res)]
    inc <- inc + n
  }
  
  normal_qq <- ppoints(n) %>%
    qnorm()
  
  qq_sim <- data.table(normal_qq = normal_qq,
                       resample_qq = resample_qq)
  
  qq_ci <- qq_sim[, .(median = median(resample_qq),
                      lower = quantile(resample_qq, 0.025),
                      upper = quantile(resample_qq, 0.975)),
                  by = normal_qq]
  return(qq_ci)
}

qq_line_coverage <- function(n_sim = 1000,
                             n = 20,
                             normal = TRUE){
  # n_sim <- 1000
  # n = 20
  # normal <- TRUE
  sim_results <- matrix(nrow = n_sim, ncol = 4)
  for(iter in 1:n_sim){
    if(normal == TRUE){
      fake_y <- rnorm(n)
    }else{
      fake_y <- rnegbin(n, mu = 10, theta = 1)
    }
    m1_res <- fake_y - mean(fake_y)
    
    fake_sd <- sd(fake_y)
    qq_table <- data.table(
      sample_qq = m1_res[order(m1_res)],
      boot_qq_band(n, sigma = fake_sd)
    )
    
    # quartile line
    m1_quartiles <- quantile(m1_res, c(.25, .75))
    m1_diff <- m1_quartiles[2] - m1_quartiles[1]
    quartile_slope <- m1_diff/1.349
    quartile_intercept <- median(m1_quartiles)
    qq_table[, expected_quartile := quartile_intercept +
               quartile_slope*normal_qq]
    qq_table[, quartile_se := quartile_slope*se_z(normal_qq, n)]
    qq_table[, quartile_lwr := expected_quartile - 1.96*quartile_se]
    qq_table[, quartile_upr := expected_quartile + 1.96*quartile_se]
    
    # robust line
    m2 <- rlm(sample_qq ~ normal_qq, data = qq_table)
    qq_table[, expected_robust := coef(m2)[1] +
               coef(m2)[2]*normal_qq]
    
    res <- 
      qq_table[, .(quartiles_pos = sum(sample_qq < quartile_lwr) +
                     sum(sample_qq > quartile_upr),
                   boot_pos = sum(sample_qq < lower) +
                     sum(sample_qq > upper),
                   quartiles_MAD = mean(abs(sample_qq - expected_quartile)/
                                          (quartile_upr - quartile_lwr)),
                   boot_MAD = mean(abs(sample_qq - expected_robust)/
                                     (upper - lower))
      )]
    sim_results[iter,] <- unlist(res[1])
    
    
  }
  colnames(sim_results) <- c("quartiles_pos",
                             "boot_pos",
                             "quartiles_MAD",
                             "boot_MAD")
  sim_results <- data.table(sim_results)
}

have_i_simulated_it <- "yes"
out_path <- here(output_folder, "qq_sim.Rds")

if(have_i_simulated_it == "no"){
  coverage_summary <- data.table(NULL)
  n_list <- c(10, 20, 40)
  model_list <- c(TRUE, FALSE) # TRUE = normal
  sim_combi <- expand.grid(n = n_list,
                           model = model_list) %>%
    data.table
  for(i in 1:nrow(sim_combi)){
    n_i <- sim_combi[i, n]
    model_i <- sim_combi[i, model]
    coverage_res <- qq_line_coverage(n = n_i, normal = model_i)
    p_list <- round(c(n_i/10, n_i/5, n_i/2.5), 0)
    for(p_i in p_list){
      res <- coverage_res[,
                          .(freq_pos_quartiles = sum(quartiles_pos >= p_i)/n_sim,
                            freq_pos_boot = sum(boot_pos >= p_i )/n_sim)]
      coverage_summary <- rbind(coverage_summary,
                                data.table(n = n_i,
                                           normal = model_i,
                                           p = p_i,
                                           res))
    }
  }
  saveRDS(coverage_summary, out_path)
}else{
  coverage_summary <- readRDS(out_path)
}


knitr::kable(coverage_summary)

n	normal	p	freq_pos_quartiles	freq_pos_boot
10	TRUE	1	0.296	0.173
10	TRUE	2	0.144	0.028
10	TRUE	4	0.018	0.001
20	TRUE	2	0.165	0.105
20	TRUE	4	0.047	0.019
20	TRUE	8	0.003	0.000
40	TRUE	4	0.108	0.097
40	TRUE	8	0.031	0.018
40	TRUE	16	0.000	0.000
10	FALSE	1	0.516	0.599
10	FALSE	2	0.204	0.302
10	FALSE	4	0.001	0.085
20	FALSE	2	0.477	0.824
20	FALSE	4	0.154	0.622
20	FALSE	8	0.002	0.248
40	FALSE	4	0.826	0.991
40	FALSE	8	0.357	0.950
40	FALSE	16	0.003	0.634

Normal Q-Q plots - what is the robust line and should we prefer it?

2020-10-15