tags: factorial design, interaction, synergism, two-way anova, linear model

tl;dr – go right to the picture!

A factorial experiment is one in which there are two or more factor variables (categorical $$X$$) that are crossed, resulting in a group for each combination of the levels of each factor. Factorial experiments are used to estimate the interaction effect between factors. Two factors interact when the effect of one factor depends on the level of the other factors. Interactions are ubiquitous, although sometimes they are small enough to ignore with little to no loss of understanding.

# 1 Maize defense response

The motivating source article contains an experiment measuring the effect of two defense signals on the defense response in Maize plants. The source data from the experiment is an excellent resource for pumping inuition on interaction effects. In response to herbivory from insects, maize, and other plants, release multiple, chemical signals into the air (chemicals that evaporate into the air are known as volatile compounds). These chemicals signal the plant, and neighboring plants, to secrete anti-herbivory hormones, including abcisic acid and jasmonic acid. The researchers investigated the effects of two volatile compounts, (Z)‐3‐hexenyl acetate (HAC) and Indole, on the defense response both each without the other and in combination.

In a different post (How to measure synergism or antagonism, I use these data to show why the interaction effect is the measure of syergism that the authors of the motivating article want.

In this post, I use these data to explain why we should take the time to teach and learn the meaning of the coefficients of linear (“regression”) models fit to the data from classical experiments with categorical (treatment) variables.

Wait what? I thought regression models were for data with continuous $$X$$ variables and that ANOVA was for experimental data. This is the way statistics is often taught and practiced but analyzing experimental data with a linear (regression) model has many advantages over analyzing with ANOVA, some of which are explored here. Wait what? I heard that linear (regression) models and ANOVA are the same thing. No. A linear model is a function modeling the expected value of the response conditional on a set of $$X$$ variables. ANOVA is a comparison of different components of the variance of the response, where these components are determined by different levels of grouping of the $$X$$ variables. A linear model can be used to get the numbers to compute these variances but this isn’t necessary and is not how ANOVA was originally done (or taught in classical biostatistics books like Zar or Sokal and Rohlf).

knitr::opts_chunk\$set(echo = TRUE, message=FALSE, warning=FALSE)

library(here)
library(janitor)
library(data.table)

# graphics and tables
library(ggplot2)
library(ggforce) # jitter
library(ggsci) # color palettes
library(ggpubfigs) # color palettes
library(ggthemes) # themes and palettes including colorblind
library(cowplot) # combine plots
library(lazyWeave)
library(kableExtra)

# analysis packages
library(emmeans)
library(lmPerm)

here <- here::here
data_folder <- "content/data"

# Okabe & Ito palette
pal_ito_seven <- friendly_pal("ito_seven") # ggpubfigs
pal_okabe_ito <- colorblind_pal()(8)[2:8] # ggthemes

The example data come from Figure 1a, which is the effect of HAC and Indole on tissue concentrations of the hormone abcisic acid (ABA). The design is fully crossed with two factors, each with two levels: hac, with levels “HAC-” and “HAC+”, and indole, with levels (“Indole-” and “Indole+”).

df <- data.frame("Indole-" = c("Control", "HAC"),
"Indole+" = c("Indole", "HAC+Indole"))
row.names(df) <- c("HAC-", "HAC+")
knitr::kable(df, col.names = c("Indole-", "Indole+"),
caption = "2 x 2 table of treatment combinations") %>%
kable_styling(bootstrap_options = "striped",
full_width = FALSE,
position = "center") %>%
column_spec(column = 1,
bold = c(T, T))
Table 1: 2 x 2 table of treatment combinations
Indole- Indole+
HAC- Control Indole
HAC+ HAC HAC+Indole

I see many researchers in bench biology analyzing the flattened structure, as if the four groups do not have the underlying $$2 \times 2$$ design.

Table 2: flattened table of treatment combinations
Group 1 Group 2 Group 3 Group 4
Control HAC Indole HAC+Indole

This isn’t wrong – all inferences (for example, estimates of effects among pairs of groups) are the same, but this analysis doesn’t estimate the interaction effect, and the interaction is the raison d’être of a factorial design.

The prevalence of flattened analyes reflects how we teach statistics, using t-tests, ANOVA, and chi-squared tests to get a p-value, instead of teaching a linear model to get estimates of effects and the uncertainty in the estimate. This lack of understanding is especially conspicuous in studies, like the source article, that fail to report the interaction effect, when it is the interaction that is the exact statistic the researchers want. Yes, a 2-way ANOVA with interaction has a p-value for the interaction effect, but the ANOVA fails to communicate what the interaction is.

# change the contrasts coding for the model matrix
con3 <- list(hac=contr.sum, indole=contr.sum)
m2 <- lm(aba ~ hac*indole, data=fig1, contrasts=con3)
car::Anova(m2, type = 3)
## Anova Table (Type III tests)
##
## Response: aba
##              Sum Sq Df  F value    Pr(>F)
## (Intercept) 22313.2  1 428.5814 5.617e-13 ***
## hac          1941.7  1  37.2951 1.514e-05 ***
## indole       1152.9  1  22.1453  0.000238 ***
## hac:indole    205.4  1   3.9448  0.064422 .
## Residuals     833.0 16
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The researchers in the source article wanted to measure synergism – a positive, interaction – but we can’t understand what synergism is from an ANOVA table. I’ll add that we can’t understand what synergism is from the coefficients of the linear model used to compute the ANOVA table (this is a linear model with effects coding for the indicator variables), because the interaction effect is spread into little bits.

But we can understand synergism, or any interaction, from the coefficients of a linear model using dummy coded indicator variables, which is the default coding in R. We should be teaching what these coefficients are.

# 2 Interaction effects are estimated with a factorial linear model

# model fit
m1 <- lm(aba ~ hac * indole, data = fig1)

A factorial design allows a researcher to estimate the interaction between two factors. An interaction is an effect between two variables in a linear model. The verbal model fit to the data in the chunk above is

$$$aba = hac + indole + hac:indole$$$

where hac:indole is the interaction term. The linear model is

\begin{align} aba = \beta_0 + \beta_1 hac_{HAC^+} &+ \beta_2 indole_{Indole^+} + \beta_3 hac_{HAC^+}:indole_{Indole^+} +\varepsilon \end{align}

1. $$hac_{HAC^+}$$ is a dummy-coded indicator variable, with a 1 assigned to plants exposed to HAC and 0 assigned to plants not exposed to HAC.
• The coefficient of $$hac_{HAC^+}$$ is $$\beta_1$$, which is the effect of adding HAC.
1. $$indole_{Indole^+}$$ is a dummy-coded indicator variable, with a 1 assigned to plants exposed to Indole and 0 assigned to plants not exposed to Indole.
• The coefficient of $$indole_{Indole^+}$$ is $$\beta_2$$, which is the effect of adding Indole.
1. $$hac_{HAC^+} : indole_{Indole^+}$$ is the interaction variable. It is the product of the two indicator variables in the interaction ($$hac_{HAC^+}$$ and $$indole_{Indole^+}$$).
• The coefficient of $$hac_{HAC^+} : indole_{Indole^+}$$ is $$\beta_3$$, which is the interaction effect. This is the effect of the combined HAC + Indole above (or below) that expected from the sum of their individual effects.

# 3 The meaning of the “Estimates” in the coefficient table

Table 3: Coefficient table
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 19.2 3.23 5.94 0.000 12.3 26.0
hacHAC+ 13.3 4.56 2.91 0.010 3.6 23.0
indoleIndole+ 8.8 4.56 1.92 0.072 -0.9 18.5
hacHAC+:indoleIndole+ 12.8 6.45 1.99 0.064 -0.9 26.5

The values in the column Estimate are the estimates of the $$\beta$$ parameters in the linear model above. These estimates are the values of the $$b$$ coefficients of the fit model

$$$aba = b_0 + b_1 hac_{HAC^+} + b_2 indole_{Indole^+} + b_3 hac_{HAC^+}:indole_{Indole^+} + e$$$

1. The Intercept ($$b_0$$) is the mean of the reference (HAC-/Indole-) group, and so the mean of the upper left cell (“Control”) in Table 3.
2. The hacHAC+ coefficient ($$b_1$$) is the estimate of the added HAC effect relative to the reference, and so is the mean of the lower left cell (“HAC”) minus the mean of the upper left cell (“Control”) in Table 3. Another way of stating this is, it is the effect of HAC when Indole is at its reference level.
3. The indoleIndole+ coefficient ($$b_2$$) is the estimate of the added Indole effect relative to the reference, and so is the mean of the upper right cell (“Indole”) minus the mean of the upper left cell (“Control”) in Table 3. Another way of stating this is, it is the effect of Indole when HAC is at its reference level.
4. The hacHAC+:indoleIndole+ coefficient ($$b_3$$) is the estimate of the interaction effect. If we added the HAC effect ($$b_1$$) and the Indole effect ($$b_2$$) to the Control mean ($$b_0$$), we would get the expected mean of the “HAC+Indole” group if the effects were additive. The hacHAC+:indoleIndole+ effect ($$b_3$$) is the additional bit that we need to get from this additive expectation to the modeled HAC+Indole mean (Figure ??). It is the difference between the mean of the bottom right cell (“HAC+Indole”) and the sum of the coefficients of the other three cells ($$b0$$, $$b1$$, $$b2$$) in Table 3.

# 4 What an interaction is

An interaction is a non-additive effect. This is true for any coding of the contrasts for the linear model. But an interaction is most intuitive using dummy coding, especially for an experiment like that in the motivating article.

Here is how to think about what an interaction using dummy coding is Adding HAC alone increases ABA concentration by 13.3 ng per g FW. Adding Indole alone increases ABA concentration by 8.8 ng per g FW. If these effects were purely additive, then adding both HAC and Indole to the Control mean should result in a mean of 19.2 + 13.3 + 8.8 = 41.3 ng per g FW in the HAC+Indole group. The modeled mean is 54.1 ng per g FW. The difference observed - additive is 54.1 - 41.3 = 12.8 ng per g FW. Compare this to the interaction coefficient in the coefficient table.

# 5 The biological interpretation of an interaction effect

The biological reasons causing interaction effects are highly variable but lets consider how interactions might arise in the context of the ABA defense response to HAC and Indole. Additive effects (no interaction) may occur when combined treatments act independently of each other. This might occur in the Maize ABA response if the signaling path from HAC reception to ABA secretion and Indole reception to ABA secretion occur in different cells or by different signaling pathways and activity in either pathway has no influence on the other. Positive, or synergistic interaction effects may occur when combined treatments augment each other’s ability to affect the response. This could occur in the Maize ABA response if an active signaling path from one of the volatile compound to ABA secretion makes the signaling path from the other compound to ABA secretion more sensitive to that compound. Negative, or antagonistic interaction effects may occur when combined treatments interfere with each other’s ability to affect the response. This could occur in the Maize ABA response if an active signaling path from one of the volatile compound to ABA secretion inhibits the signaling path from the other compound to ABA secretion. Negative interaction effects could also occur if there is a simple threshold response and all it takes is either the HAC or the Indole signaling pathway to activate the response.