The ANOVA model and sum-of-squares decomposition

 
Figure 1: The underlying distributions in the ANOVA model. Under the null, all the distributions have the same vertical offset.
The word “analysis” in ANOVA is used in the classical sense of to break something into its parts1. ANOVA breaks the observed variation into distinct components, \[ \sum_{ij} \color{#ff8200}{\left(y_{ij} - \bar{y}\right)}^2 = n\sum_{i} \color{#447583}{\left(\bar{y}_i - \bar{y}\right)}^2 + \sum_{i,j} \color{#b090c2}{\left(y_{ij} - \bar{y}_{i}\right)}^2 \] which is abbreviated as
\[ \color{#ff8200}{SS_{\text{total}}} = \color{#447583}{SS_{\text{treatments}}} + \color{#b090c2}{SS_{\text{error}}}. \] This is usually called the “ANOVA identity.”


 
Figure 2: Visual representation of the three terms in the ANOVA identity.
If any of the groups had a mean that was different from the global mean, then we’d expect the term to be larger than it would otherwise be. How large is large enough to reject?
Since the variance within each group is \(\sigma^2\), the variance of each \(\bar{y}_i\) is \(\frac{\sigma^2}{n}\). The blue term looks like how we would usually estimate the variances of the \(y_i\), i.e.,
\[ \frac{1}{a - 1}\sum_{i}\color{#447583}{\left(\bar{y}_i - \bar{y}\right)}^2 \approx \frac{\sigma^2}{n} \] 7. On the other hand, under the null, all the \(y_{ij} \sim \mathcal{N}\left(\mu, \sigma^2\right)\), so we would also know, \[ \frac{1}{N-a} \sum_{i, j}\color{#b090c2}{\left(y_{i, j}-\bar{y}_{i}\right)}^{2} \approx \sigma^{2}, \] so under the null, \[ \frac{\frac{\color{#447583}{SS_{\text {treatment }}}}{a-1}}{\color{#b090c2}{\frac{SS_{\text {error }}}{N-a}}} \approx 1. \] Note that under the alternative, it would be larger than 1.
 
Figure 3: Under the null, the scaling between the treatment and error sums of squares is known.
library(readr)
etch_rate <- read_csv("https://uwmadison.box.com/shared/static/vw3ldbgvgn7rupt4tz3ditl1mpupw44h.csv")
etch_rate$power <- as.factor(etch_rate$power) # consider as discrete levels
etch_rate
# A tibble: 20 × 3
   power replicate  rate
   <fct> <chr>     <dbl>
 1 160   Ob1         575
 2 160   Ob2         542
 3 160   Ob3         530
 4 160   Ob4         539
 5 160   Ob5         570
 6 180   Ob1         565
 7 180   Ob2         593
 8 180   Ob3         590
 9 180   Ob4         579
10 180   Ob5         610
11 200   Ob1         600
12 200   Ob2         651
13 200   Ob3         610
14 200   Ob4         637
15 200   Ob5         629
16 220   Ob1         725
17 220   Ob2         700
18 220   Ob3         715
19 220   Ob4         685
20 220   Ob5         710 
Figure 4: Etch rate as a function of power.
lm and anova functions. The column statistic is the F statistic, and the \(p\)-value is derived from the reference distribution of that statistic under the null.Analysis of Variance Table
Response: rate
          Df Sum Sq Mean Sq F value    Pr(>F)    
power      3  66871 22290.2  66.797 2.883e-09 ***
Residuals 16   5339   333.7                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1data.frame using tidy() in the broom package.[1] 22290.18   333.70aov_table
# A tibble: 2 × 6
  term         df  sumsq meansq statistic  p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>    <dbl>
1 power         3 66871. 22290.      66.8  2.88e-9
2 Residuals    16  5339.   334.      NA   NA      lm function, because the outcome of interest isn’t isolated into a single column.etch <- read_csv("https://uwmadison.box.com/shared/static/3ltmo89ea0xowsh1386x9fk58qc51ned.txt")
etch$Power <- as.factor(etch$Power)
etch
# A tibble: 4 × 6
  Power   Ob1   Ob2   Ob3   Ob4   Ob5
  <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 160     575   542   530   539   570
2 180     565   593   590   579   610
3 200     600   651   610   637   629
4 220     725   700   715   685   710pivot_longer function from the tidyr package.library(tidyr)
pivot_longer(etch, -Power, names_to = "replicate", values_to = "etch_rate")
# A tibble: 20 × 3
   Power replicate etch_rate
   <fct> <chr>         <dbl>
 1 160   Ob1             575
 2 160   Ob2             542
 3 160   Ob3             530
 4 160   Ob4             539
 5 160   Ob5             570
 6 180   Ob1             565
 7 180   Ob2             593
 8 180   Ob3             590
 9 180   Ob4             579
10 180   Ob5             610
11 200   Ob1             600
12 200   Ob2             651
13 200   Ob3             610
14 200   Ob4             637
15 200   Ob5             629
16 220   Ob1             725
17 220   Ob2             700
18 220   Ob3             715
19 220   Ob4             685
20 220   Ob5             710i.e. the opposite of “synthesis.”↩︎