Dealing with batch effects using a generalization of pairing.
How can we generalize the idea of pairing to the case that we have more than two treatments to compare? Suppose that the nuisance factor is known and controllable. For simplicity, also assume that the blocks have the same number \(a\) of samples as we have treatments (\(a=2 \to\) pairing)
Example [page 120]. A medical device manufacturer creates vascular grafts, and wants to see the effect of manufacturing pressure on defect rate. But the resin used to make the graft arrives in batches, and may itself contribute to defects.
Idea: Randomly assign each of the \(a\) treatments to the units within each block. This is called a Randomized Complete Block Design (RCBD).
But now we also have * \(\beta_{j}\), the effect of the \(j^{th}\) block
We’re interested in whether any treatments have an effect, \[\begin{align*} H_0 &: \tau_1 = \dots = \tau_a = 0 \\ H_{1} &: \tau_{i} \neq 0 \text{ for at least one } i \end{align*}\]
There’s a useful decomposition of the total variance that accounts for block-to-block variation, \[\begin{align*} \sum_{i = 1}^{a}\sum_{j = 1}^{b}\left(y_{ij} - \bar{y}_{\cdot\cdot}\right)^2 = &b\sum_{i = 1}^{a} \left(\bar{y}_{i\cdot} - \bar{y}_{\cdot\cdot}\right)^2 + \\ & a \sum_{j = 1}^{b}\left(\bar{y}_{\cdot j} - \bar{y}_{\cdot\cdot}\right)^2 + \\ &\sum_{i = 1}^{a}\sum_{j = 1}^{b} \left(y_{ij} - \bar{y}_{i\cdot} - \bar{y}_{\cdot j} + \bar{y}_{\cdot\cdot}\right)^2 \end{align*}\] We’ll abbreviate this as \(SS_{\text{Total}} = SS_{\text{Treatment}} + SS_{\text{Block}} + SS_{E}\).
The corresponding mean squares are,
\[ \frac{MS_{\text{Treatment}}}{MS_{E}} \sim F\left(a - 1, \left(a - 1\right)\left(b - 1\right)\right) \] so we can reject the null hypothesis if this statistic is larger than the \(1 - \alpha\) quantile of an \(F\) distribution with \((a - 1, (a - 1)(b - 1))\) d.f.
library(readr)
library(tidyr)
library(ggplot2)
graft <- read_table2("https://uwmadison.box.com/shared/static/0ciblk4z2f3k6zizbj4plg3q33w1d0n6.txt") %>%
pivot_longer(-Pressure, names_to = "batch", values_to = "yield")
graft$Pressure <- as.factor(graft$Pressure)
ggplot(graft) +
geom_point(aes(x = batch, y = yield, col = Pressure)) +
scale_color_brewer(palette = "Set2") # custom color scheme
lm
function to fit the RCBD model. As before, the response goes on the left hand side, but now the right hand side includes both the variable of interest (Pressure
) and an indicator of which batch each sample came from (batch
). We can build an ANOVA table for the expanded table using the same aov
table as before. Df Sum Sq Mean Sq F value Pr(>F)
Pressure 3 178.2 59.39 8.107 0.00192 **
batch 5 192.2 38.45 5.249 0.00553 **
Residuals 15 109.9 7.33
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It appears that there is a significant batch effect. Controlling for this effect, we are able to detect a significant relationship between pressure and yield. Note that the results are identical to those in Figure 4.2 of the textbook.
If we hadn’t controlled for the batch effect, we would have still seen this relationship, but we would have lower power (the \(p\)-value is not as low here as it had been before).
Df Sum Sq Mean Sq F value Pr(>F)
Pressure 3 178.2 59.39 3.931 0.0234 *
Residuals 20 302.1 15.11
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1