How should we check the assumptions of the ANOVA model?
Recall the ANOVA model, \[ y_{i j}=\mu+\tau_{i}+\epsilon_{i j} \] with independent errors \(\epsilon_{ij} \sim \mathcal{N}\left(0, \sigma^2\right)\). There are a few ways that this model can fail,
To see if the model is okay, it will be helpful to define residuals, \[\begin{align*} e_{i j} &=y_{i j}-\hat{y}_{i j} \\ &=y_{i j}-\bar{y}_{i}. \end{align*}\] Residuals are our best guess of what the true random error \(\epsilon_{ij}\) is like.
resid
function; the example below continues the etch rate example.library(readr)
etch_rate <- read_csv("https://uwmadison.box.com/shared/static/vw3ldbgvgn7rupt4tz3ditl1mpupw44h.csv")
etch_rate$power <- as.factor(etch_rate$power) # consider as discrete levels
fit <- lm(rate ~ power, data = etch_rate)
resid(fit)
1 2 3 4 5 6 7 8 9 10 11
23.8 -9.2 -21.2 -12.2 18.8 -22.4 5.6 2.6 -8.4 22.6 -25.4
12 13 14 15 16 17 18 19 20
25.6 -15.4 11.6 3.6 18.0 -7.0 8.0 -22.0 3.0
etch_rate$residual <- resid(fit)
ggplot(etch_rate) +
geom_point(aes(power, residual))
We can use plots to check for independence. For example, if you plot the residuals over time and you see clear trends, then the errors are likely correlated over time.
It’s often useful to plot residuals against the fitted values. This can reveal nonconstant variance across the groups \(i\).
etch_rate$fitted <- predict(fit)
ggplot(etch_rate) +
geom_point(aes(fitted, residual))
What can you do if you detect nonconstant variance across groups? The most common fix is to use a variance stabilizing transformation. That is, apply some function \(f(x)\) to each data point and then perform the ANOVA.
There are various rules of thumb1, though the process is still somewhat informal,
x <- rpois(4000, 5)
x <- data.frame(x)
ggplot(x) +
geom_histogram(aes(x = x), binwidth = 0.5)
ggplot(x) +
geom_histogram(aes(x = sqrt(1 + x)), binwidth = 0.1)
It’s not at all obvious why any of these transformations are effective – they are typically derived in introductory mathematical statistics courses.↩︎