ANOVA

The ANOVA model and sum-of-squares decomposition

Kris Sankaran true
09-21-2021

Readings 3.1 - 3.3, Rmarkdown

  1. ANOVA is used when we want to compare the effect of different treatments on a continuous response. For example,
    • How does the etch rate of a tool depend on its power setting?
    • How do an opera company’s different donation strategies compare with one another?
    • How does the average rental time compare across cars?
    It is an extension of two sample testing when there are more than two levels possible for a single factor.

Model and Test Setup

  1. Formally, consider the model, \[ y_{ij} = \mu + \tau_i + \epsilon_{ij} \] where \(i=1 \dots a\) and \(j=1, \dots, n\) and the errors \(\epsilon_{ij} \sim \mathcal{N}\left(0, \sigma^2\right)\) are independent.
    • \(i\) indexes different groups
    • \(j\) indexes the samples within groups
    • \(a\) is the number of groups
    • \(n\) is the number of samples in each group
    • \(N=na\) is the total number of samples
The underlying distributions in the ANOVA model. Under the null, all the distributions have the same vertical offset.

Figure 1: The underlying distributions in the ANOVA model. Under the null, all the distributions have the same vertical offset.

  1. Our null hypothesis is that none of the groups deviate from the global mean. The alternative is that at least one of the groups is different. Formally, \[ H_0: \tau_1 = \dots, = \tau_{a} = 0 \\ H_1: \tau_{i} \neq 0 \text{ for at least one }i. \]

Important Identities

  1. The word “analysis” in ANOVA is used in the classical sense of to break something into its parts1. ANOVA breaks the observed variation into distinct components, \[ \sum_{ij} \color{#ff8200}{\left(y_{ij} - \bar{y}\right)}^2 = n\sum_{i} \color{#447583}{\left(\bar{y}_i - \bar{y}\right)}^2 + \sum_{i,j} \color{#b090c2}{\left(y_{ij} - \bar{y}_{i}\right)}^2 \] which is abbreviated as

    \[ \color{#ff8200}{SS_{\text{total}}} = \color{#447583}{SS_{\text{treatments}}} + \color{#b090c2}{SS_{\text{error}}}. \] This is usually called the “ANOVA identity.”

Visual representation of the three terms in the ANOVA identity.Visual representation of the three terms in the ANOVA identity.Visual representation of the three terms in the ANOVA identity.

Figure 2: Visual representation of the three terms in the ANOVA identity.

  1. If any of the groups had a mean that was different from the global mean, then we’d expect the term to be larger than it would otherwise be. How large is large enough to reject?

  2. Since the variance within each group is \(\sigma^2\), the variance of each \(\bar{y}_i\) is \(\frac{\sigma^2}{n}\). The blue term looks like how we would usually estimate the variances of the \(y_i\), i.e.,

\[ \frac{1}{a - 1}\sum_{i}\color{#447583}{\left(\bar{y}_i - \bar{y}\right)}^2 \approx \frac{\sigma^2}{n} \] 7. On the other hand, under the null, all the \(y_{ij} \sim \mathcal{N}\left(\mu, \sigma^2\right)\), so we would also know, \[ \frac{1}{N-a} \sum_{i, j}\color{#b090c2}{\left(y_{i, j}-\bar{y}_{i}\right)}^{2} \approx \sigma^{2}, \] so under the null, \[ \frac{\frac{\color{#447583}{SS_{\text {treatment }}}}{a-1}}{\color{#b090c2}{\frac{SS_{\text {error }}}{N-a}}} \approx 1. \] Note that under the alternative, it would be larger than 1.

  1. From our results in the probability review lectures, both the and are chi-squares, with \(a - 1\) and \(N - a\) d.f., respectively. It’s not obvious, but they’re also independent (this is called Cochran’s theorem). Therefore, the null reference distribution is an \(F\) distribution with \((a - 1, N - a)\) d.f.
Under the null, the scaling between the treatment and error sums of squares is known.

Figure 3: Under the null, the scaling between the treatment and error sums of squares is known.

  1. We usually call the numerator and denominator above \(\color{#447583}{MS_{\text{treatment}}}\) and \(\color{#b090c2}{MS_{E}}\).

Code Example

  1. Let’s read in an example dataset. We are looking at the etch rate of a machine under three different power settings. We want to know whether there is any difference in the rates, as a function of the power.
library(readr)

etch_rate <- read_csv("https://uwmadison.box.com/shared/static/vw3ldbgvgn7rupt4tz3ditl1mpupw44h.csv")
etch_rate$power <- as.factor(etch_rate$power) # consider as discrete levels
etch_rate
# A tibble: 20 × 3
   power replicate  rate
   <fct> <chr>     <dbl>
 1 160   Ob1         575
 2 160   Ob2         542
 3 160   Ob3         530
 4 160   Ob4         539
 5 160   Ob5         570
 6 180   Ob1         565
 7 180   Ob2         593
 8 180   Ob3         590
 9 180   Ob4         579
10 180   Ob5         610
11 200   Ob1         600
12 200   Ob2         651
13 200   Ob3         610
14 200   Ob4         637
15 200   Ob5         629
16 220   Ob1         725
17 220   Ob2         700
18 220   Ob3         715
19 220   Ob4         685
20 220   Ob5         710
Etch rate as a function of power.

Figure 4: Etch rate as a function of power.

  1. To compute all the quantities above, we can use the lm and anova functions. The column statistic is the F statistic, and the \(p\)-value is derived from the reference distribution of that statistic under the null.
fit <- lm(rate ~ power, data = etch_rate)
anova(fit)
Analysis of Variance Table

Response: rate
          Df Sum Sq Mean Sq F value    Pr(>F)    
power      3  66871 22290.2  66.797 2.883e-09 ***
Residuals 16   5339   333.7                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. To extract terms from this table, it is helpful to first convert it to a data.frame using tidy() in the broom package.
library(broom)
aov_table <- tidy(anova(fit))
aov_table$meansq
[1] 22290.18   333.70
aov_table
# A tibble: 2 × 6
  term         df  sumsq meansq statistic  p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>    <dbl>
1 power         3 66871. 22290.      66.8  2.88e-9
2 Residuals    16  5339.   334.      NA   NA      
  1. What if our data were arranged like the data.frame below? We can no longer use the lm function, because the outcome of interest isn’t isolated into a single column.
etch <- read_csv("https://uwmadison.box.com/shared/static/3ltmo89ea0xowsh1386x9fk58qc51ned.txt")
etch$Power <- as.factor(etch$Power)
etch
# A tibble: 4 × 6
  Power   Ob1   Ob2   Ob3   Ob4   Ob5
  <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 160     575   542   530   539   570
2 180     565   593   590   579   610
3 200     600   651   610   637   629
4 220     725   700   715   685   710
  1. To reorganize the data into an acceptable form, we can use the pivot_longer function from the tidyr package.
library(tidyr)
pivot_longer(etch, -Power, names_to = "replicate", values_to = "etch_rate")
# A tibble: 20 × 3
   Power replicate etch_rate
   <fct> <chr>         <dbl>
 1 160   Ob1             575
 2 160   Ob2             542
 3 160   Ob3             530
 4 160   Ob4             539
 5 160   Ob5             570
 6 180   Ob1             565
 7 180   Ob2             593
 8 180   Ob3             590
 9 180   Ob4             579
10 180   Ob5             610
11 200   Ob1             600
12 200   Ob2             651
13 200   Ob3             610
14 200   Ob4             637
15 200   Ob5             629
16 220   Ob1             725
17 220   Ob2             700
18 220   Ob3             715
19 220   Ob4             685
20 220   Ob5             710

  1. i.e. the opposite of “synthesis.”↩︎