How should we check the assumptions of the ANOVA model?
Recall the ANOVA model, \[ y_{i j}=\mu+\tau_{i}+\epsilon_{i j} \] with independent errors \(\epsilon_{ij} \sim \mathcal{N}\left(0, \sigma^2\right)\). There are a few ways that this model can fail,
To see if the model is okay, it will be helpful to define residuals, \[\begin{align*} e_{i j} &=y_{i j}-\hat{y}_{i j} \\ &=y_{i j}-\bar{y}_{i}. \end{align*}\] Residuals are our best guess of what the true random error \(\epsilon_{ij}\) is like.
Figure 1: The true error depends on the unknown means for each group; however, residuals can give a close approximation.
resid
function; the example below continues the etch rate example.library(readr)
etch_rate <- read_csv("https://uwmadison.box.com/shared/static/vw3ldbgvgn7rupt4tz3ditl1mpupw44h.csv")
etch_rate$power <- as.factor(etch_rate$power) # consider as discrete levels
fit <- lm(rate ~ power, data = etch_rate)
resid(fit)
1 2 3 4 5 6 7 8 9 10 11
23.8 -9.2 -21.2 -12.2 18.8 -22.4 5.6 2.6 -8.4 22.6 -25.4
12 13 14 15 16 17 18 19 20
25.6 -15.4 11.6 3.6 18.0 -7.0 8.0 -22.0 3.0
etch_rate$residual <- resid(fit)
ggplot(etch_rate) +
geom_point(aes(power, residual))
Figure 3: There doesn’t seem to be a relationship between the measured variable and the residuals, so there is no reason to suspect missing terms in the model. (The third power level has slightly higher variance, but it’s barely noticeable.)
We can use plots to check for independence. For example, if you plot the residuals over time and you see clear trends, then the errors are likely correlated over time.
It’s often useful to plot residuals against the fitted values. This can reveal nonconstant variance across the groups \(i\).
etch_rate$fitted <- predict(fit)
ggplot(etch_rate) +
geom_point(aes(fitted, residual))
Figure 4: An example plotting residuals against the fitted values.
What can you do if you detect nonconstant variance across groups? The most common fix is to use a variance stabilizing transformation. That is, apply some function \(f(x)\) to each data point and then perform the ANOVA.
There are various rules of thumb1, though the process is still somewhat informal,
x <- rpois(4000, 5)
x <- data.frame(x)
ggplot(x) +
geom_histogram(aes(x = x), binwidth = 0.5)
Figure 5: An example of using a transformation to bring counts data closer to normality.
ggplot(x) +
geom_histogram(aes(x = sqrt(1 + x)), binwidth = 0.1)
Figure 6: An example of using a transformation to bring counts data closer to normality.
It’s not at all obvious why any of these transformations are effective – they are typically derived in introductory mathematical statistics courses.↩︎