Statistical Experimental Design: Model Checking

Recall the ANOVA model, \[ y_{i j}=\mu+\tau_{i}+\epsilon_{i j} \] with independent errors \(\epsilon_{ij} \sim \mathcal{N}\left(0, \sigma^2\right)\). There are a few ways that this model can fail,
- There might be systematic variations besides the group deviations \(\tau_i\).
- The errors might not be normally distributed
- The errors might not be independent
- The variance might not be the same in each group
To see if the model is okay, it will be helpful to define residuals, \[\begin{align*} e_{i j} &=y_{i j}-\hat{y}_{i j} \\ &=y_{i j}-\bar{y}_{i}. \end{align*}\] Residuals are our best guess of what the true random error \(\epsilon_{ij}\) is like.

Figure 1: The true error depends on the unknown means for each group; however, residuals can give a close approximation.

We can extract this using the resid function; the example below continues the etch rate example.

library(readr)
etch_rate <- read_csv("https://uwmadison.box.com/shared/static/vw3ldbgvgn7rupt4tz3ditl1mpupw44h.csv")
etch_rate$power <- as.factor(etch_rate$power) # consider as discrete levels
fit <- lm(rate ~ power, data = etch_rate)
resid(fit)

    1     2     3     4     5     6     7     8     9    10    11 
 23.8  -9.2 -21.2 -12.2  18.8 -22.4   5.6   2.6  -8.4  22.6 -25.4 
   12    13    14    15    16    17    18    19    20 
 25.6 -15.4  11.6   3.6  18.0  -7.0   8.0 -22.0   3.0

Normal Probability Plots

We can’t check normality of \(\epsilon_{ij}\) directly, but we can check normality of the residuals \(e_{ij}\) using normal probability plots.

qqnorm(resid(fit))
qqline(resid(fit))

Figure 2: A qqplot for the etch rate data.

Of the ways that the model can fail, normality of the residuals is not the most severe, because you can often count on the central limit theorem to make the reference \(F\) distribution still approximately correct.

Plotting Residuals

The way to check for systematic variation beyond the \(\tau_i\)’s, try plotting residuals against measured variables. If you see “patterns,” those may correspond to missing terms in the model.

etch_rate$residual <- resid(fit)
ggplot(etch_rate) +
  geom_point(aes(power, residual))

There doesn't seem to be a relationship between the measured variable and the residuals, so there is no reason to suspect missing terms in the model. (The third power level has slightly higher variance, but it's barely noticeable.)

Figure 3: There doesn’t seem to be a relationship between the measured variable and the residuals, so there is no reason to suspect missing terms in the model. (The third power level has slightly higher variance, but it’s barely noticeable.)

We can use plots to check for independence. For example, if you plot the residuals over time and you see clear trends, then the errors are likely correlated over time.
It’s often useful to plot residuals against the fitted values. This can reveal nonconstant variance across the groups \(i\).

etch_rate$fitted <- predict(fit)
ggplot(etch_rate) +
  geom_point(aes(fitted, residual))

Figure 4: An example plotting residuals against the fitted values.

Testing Equality of Variances

There are formal tests to test whether the equal variance assumption of the ANOVA is valid (it’s very meta). The most common are,
- Bartlett’s test
- The Modified Levene test
The main difference is that the Modified Levene test is still valid even when the errors are not normally distributed. You don’t need to memorize the test statistics, but know that they exist, and be able to interpret associated computer output.

Transformations

What can you do if you detect nonconstant variance across groups? The most common fix is to use a variance stabilizing transformation. That is, apply some function \(f(x)\) to each data point and then perform the ANOVA.
There are various rules of thumb¹, though the process is still somewhat informal,
- \(f(x) = \sqrt{x}\) or \(f(x) = \sqrt{1 + x}\) if the data are counts
- \(f(x) = \log x\) if the data seem lognormal
- \(f(x) = \arcsin\left(\sqrt{x}\right)\) if the data are binomial-derived fractions

x <- rpois(4000, 5)
x <- data.frame(x)

ggplot(x) +
  geom_histogram(aes(x = x), binwidth = 0.5)

Figure 5: An example of using a transformation to bring counts data closer to normality.

ggplot(x) +
  geom_histogram(aes(x = sqrt(1 + x)), binwidth = 0.1)

Figure 6: An example of using a transformation to bring counts data closer to normality.

It’s not at all obvious why any of these transformations are effective – they are typically derived in introductory mathematical statistics courses.↩︎