The ANOVA model and sum-of-squares decomposition
The word “analysis” in ANOVA is used in the classical sense of to break something into its parts1. ANOVA breaks the observed variation into distinct components, \[ \sum_{ij} \color{#ff8200}{\left(y_{ij} - \bar{y}\right)}^2 = n\sum_{i} \color{#447583}{\left(\bar{y}_i - \bar{y}\right)}^2 + \sum_{i,j} \color{#b090c2}{\left(y_{ij} - \bar{y}_{i}\right)}^2 \] which is abbreviated as
\[ \color{#ff8200}{SS_{\text{total}}} = \color{#447583}{SS_{\text{treatments}}} + \color{#b090c2}{SS_{\text{error}}}. \] This is usually called the “ANOVA identity.”
If any of the groups had a mean that was different from the global mean, then we’d expect the term to be larger than it would otherwise be. How large is large enough to reject?
Since the variance within each group is \(\sigma^2\), the variance of each \(\bar{y}_i\) is \(\frac{\sigma^2}{n}\). The blue term looks like how we would usually estimate the variances of the \(y_i\), i.e.,
\[ \frac{1}{a - 1}\sum_{i}\color{#447583}{\left(\bar{y}_i - \bar{y}\right)}^2 \approx \frac{\sigma^2}{n} \] 7. On the other hand, under the null, all the \(y_{ij} \sim \mathcal{N}\left(\mu, \sigma^2\right)\), so we would also know, \[ \frac{1}{N-a} \sum_{i, j}\color{#b090c2}{\left(y_{i, j}-\bar{y}_{i}\right)}^{2} \approx \sigma^{2}, \] so under the null, \[ \frac{\frac{\color{#447583}{SS_{\text {treatment }}}}{a-1}}{\color{#b090c2}{\frac{SS_{\text {error }}}{N-a}}} \approx 1. \] Note that under the alternative, it would be larger than 1.
library(readr)
etch_rate <- read_csv("https://uwmadison.box.com/shared/static/vw3ldbgvgn7rupt4tz3ditl1mpupw44h.csv")
etch_rate$power <- as.factor(etch_rate$power) # consider as discrete levels
etch_rate
# A tibble: 20 × 3
power replicate rate
<fct> <chr> <dbl>
1 160 Ob1 575
2 160 Ob2 542
3 160 Ob3 530
4 160 Ob4 539
5 160 Ob5 570
6 180 Ob1 565
7 180 Ob2 593
8 180 Ob3 590
9 180 Ob4 579
10 180 Ob5 610
11 200 Ob1 600
12 200 Ob2 651
13 200 Ob3 610
14 200 Ob4 637
15 200 Ob5 629
16 220 Ob1 725
17 220 Ob2 700
18 220 Ob3 715
19 220 Ob4 685
20 220 Ob5 710
lm
and anova
functions. The column statistic
is the F statistic, and the \(p\)-value is derived from the reference distribution of that statistic under the null.Analysis of Variance Table
Response: rate
Df Sum Sq Mean Sq F value Pr(>F)
power 3 66871 22290.2 66.797 2.883e-09 ***
Residuals 16 5339 333.7
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data.frame
using tidy()
in the broom
package.[1] 22290.18 333.70
aov_table
# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 power 3 66871. 22290. 66.8 2.88e-9
2 Residuals 16 5339. 334. NA NA
lm
function, because the outcome of interest isn’t isolated into a single column.etch <- read_csv("https://uwmadison.box.com/shared/static/3ltmo89ea0xowsh1386x9fk58qc51ned.txt")
etch$Power <- as.factor(etch$Power)
etch
# A tibble: 4 × 6
Power Ob1 Ob2 Ob3 Ob4 Ob5
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 160 575 542 530 539 570
2 180 565 593 590 579 610
3 200 600 651 610 637 629
4 220 725 700 715 685 710
pivot_longer
function from the tidyr
package.library(tidyr)
pivot_longer(etch, -Power, names_to = "replicate", values_to = "etch_rate")
# A tibble: 20 × 3
Power replicate etch_rate
<fct> <chr> <dbl>
1 160 Ob1 575
2 160 Ob2 542
3 160 Ob3 530
4 160 Ob4 539
5 160 Ob5 570
6 180 Ob1 565
7 180 Ob2 593
8 180 Ob3 590
9 180 Ob4 579
10 180 Ob5 610
11 200 Ob1 600
12 200 Ob2 651
13 200 Ob3 610
14 200 Ob4 637
15 200 Ob5 629
16 220 Ob1 725
17 220 Ob2 700
18 220 Ob3 715
19 220 Ob4 685
20 220 Ob5 710
i.e. the opposite of “synthesis.”↩︎