Managing Batch Effects in Biological Data

## Managing Batch Effects in Biological Data

<div id="subtitle">
Kris Sankaran <br/>
23 | October | 2024 <br/>
Lab: <a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> <br/>
</div>

<div id="subtitle_right">
BME 780: Methods in Quantitative Biology<br/>
Slides: <a href="https://go.wisc.edu/1lv9js">go.wisc.edu/1lv9js</a><br/>
</div>

---

### Learning Outcomes

1. **Imagine**: Why might batch effects appear in your data?

1. **Derive**: Interpret algebraic expressions used to represent batch effect correction methods.

1. **Check**: Reason about checks that can give more confidence in a batch effect correction method's output.

1. **Improve**: Explain the potential for replication and controls to improve batch effect correction methods

---

---

### Cancer Microbiome Controversy [1]

In June 2024, Nature retracted a paper [2] the claimed
identify microbiome signatures of cancer. This came after one year's worth of
debate [3; 4; 5]
about the data analysis.

---

### Cancer Microbiome Controversy [1]

The "disease signature" was an artifact resulting from the use of a batch effect
correction method.  Before we can understand the nuances of the story, we need
to learn about batch effects and correction methods.

---

### Sources of Technical Variation

There are many points at which technical variation arises. For example, for single-cell sequencing [6]:
.pull-left[
* Sample Collection
* Sample Storage
* Library Preparation
* cDNA Synthesis
]

Differences at any step can lead to systematic differences between data
collected across experimental runs ("batches"). This is especially problematic
for studies with large sample sizes or multiple study sites.

---

### Measurement vs. Reality [7]

.pull-left[
1. In the same way that a photo can be blurry or obscured by glare, the
measurements we obtain from sequencing can be obscured by technical factors.

1. Any data are only an imperfect snapshot of the biological scene we're seeking
to understand. 
]

.pull-right[
    <img src="figures/kensington.jpg" width=320/>
<p class="caption">
The glare is a technical artifact of the measuring device.
</p>
]

---

### Measurement vs. Reality [7]

.pull-left[
1. In the same way that a photo can be blurry or obscured by glare, the
measurements we obtain from sequencing can be obscured by technical factors.

1. Any data are only an imperfect snapshot of the biological scene we're seeking
to understand. 
]

.pull-right[
    <img src="figures/goodsell_phagophore.jpg" width=300/>
<p class="caption">
It is hard to precisely measure molecular systems. Artwork from [8].
</p>
]

---

### Improving Discovery

1. To discover more interesting patterns, we need to either remove these effects
or account for them in the analysis.
]

.pull-right[
<img src="figures/batch_effects_leek.png" width=700/>
<p class="caption">
Read coverage across samples from the HapMap project, discussed in [9]. Horizontal blocks have the same sequencing date.
</p>
]

---

### Improving Discovery

1. To discover more interesting patterns, we need to either remove these effects
or account for them in the analysis.
]

.pull-right[
<img src="figures/batch_effects_eva.png" width=700/>
<p class="caption">
Batch effects across collection dates from a microbiome 
study, discussed in [10].
</p>
]

---

### Improving Credibility

> If all goes well, then the article... has shown mammal kidney structure; if all goes badly, it shrinks to three hamsters in one laboratory in 1984.
>
> -- Bruno Latour in [11]

To dispute a scientific fact, a dissenter can always argue that the
conditions that led to its production are not generalizable. To guard against
this, we need to understand batch-to-batch variation.

---

---

### General Strategies

1. Correct: Subtract the batch effects before all other analysis.
1. Account: Directly model batch-to-batch variation in the analysis.

|  | Pros | Cons |
|--|---|---|
| Correct | Generality. Can be reused easily. | Can introduce its own artifacts. |
| Account | Cohesion. Only one method needed for everything.  | Time consuming. Bespoke development for each context. |

We will review correction methods.

---

### Correction Methods

.pull-left[
Any method must define,
1. Transformations: How might batch effects impact measurement?
1. Objective: How can we find the most plausible transformation for our data?
]

The best choice can depend on properties of the data and the experimental
design, which is why there are so many available methods.

---

### Exemplar Methods

Today and next week, we'll consider:

1. ComBat [12] (2007, 7.6K citations). Linear Model.
1. Harmony [13] (2019, 4.8K citations). Clustering + Weighted Least Squares.
1. CellAnova [14] (preprint, 2023). Clustering + Linear Model.

---

### ComBat - Setting

ComBat was proposed to remove batch effects in microarray data with small sample
sizes, `\(n \approx 5 - 20\)` per batch, and it has inspired many variants.

1. `\(\*y_i \in \reals^{G}\)`: Sample `\(i\)`'s expression levels across `\(G\)` genes.
1. `\(\*x_i \in \reals^{D}\)`: Potentially relevant biological factors.
1. `\(m\left(i\right) \in \left[1, \dots, M \right]\)`: Which of the `\(M\)` batches does sample `\(i\)` belong to?

---

### ComBat - Model

Sample `\(i\)` is assumed to be drawn from:

`\begin{align*}
\*y_{i} = \mu + \*W \*x_{i} + \*b_{m(i)} + \Lambda_{m(i)}\*\epsilon_i
\end{align*}`

where `\(\Lambda_{m} = \diag{\delta_{mg}}\)` rescales the noise `\(\epsilon_i \sim \mathcal{N}\left(0, \diag{\sigma_g^2}\right)\)`.

1. The parameters `\(\*W\)` capture shared biological effects.
1. Batch effects cause systematic location-scale shifts.
1. Given `\(\*b_m\)` and `\(\*\Lambda_{m}\)`, we can remove their effect from
the observed data.
---

### ComBat - Estimation

Rather than estimating each gene `\(g\)` separately, ComBat uses a hierarchical
model to share information across all genes. This improves stability in the
corrections for genes with very high noise levels.

.center[
<img src="figures/combat_fig3.gif" width=330/>
]
<p class="caption">
By fitting the model simultaneously across genes, the estimates are "shrunk"
towards the mean. 
</p>

---

### Harmony - Setting

Harmony was proposed to integrate single-cell data across experiments. Since
these data have much larger sample sizes, it's possible to learn a more complex
transformation.

.center[
<img src="figures/harmony_integration.png" width=840/>
]
<p class="caption">
First cluster to find cell types and then make batches overlap within each cluster.
</p>

---

### Harmony - Clustering

`\begin{align*}
\min_{\*\mu_k \in \reals^{G}, \*r_i \in \Delta^{K}} \sum_{i = 1}^{N} \left[\sum_{k = 1}^{K} r_{ik} \|\*y_i - \*\mu_{k}\|_{2}^{2} - \lambda H\left(\*r_{i}\right)\right]
\end{align*}`
where `\(H\left(\*r_i\right)\)` is the entropy of sample `\(i\)`'s responsibilities.
]

---

### Harmony - Clustering

They use a chi-square statistic to encourage clusters to contain an evan mix
across batches (strong associations lead to large values of the statistic).

`\begin{align*}
\min_{\*\mu_k \in \reals^{G}, \*r_i \in \Delta^{K}} \sum_{i = 1}^{N} &\left[\sum_{k = 1}^{K}r_{ik}\|\*y_i - \*\mu_{k}\|_{2}^{2} - \lambda H\left(\*r_{i}\right)\right] +\chi^2\left(\text{batch}, \text{cluster}\right)
\end{align*}`

---

### Harmony - Correction

They use weighted least squares to estimate batch means for each cluster,
`\begin{align*}
\sum_{k} r_{ik} \|\*y_i - \mu_k - \*b^k_{m\left(i\right)}\|_{2}^2
\end{align*}`
The corrected data are `\(\tilde{\*y}_i = \sum_{k}r_{ik}\mu_k\)`.

---

### Challenges

Batch effect correction can introduce subtle problems of its own.

**Hypothesis Testing**: When the batches are highly imbalanced, correction can
distort the null distribution of test statistics. This invalidates the usual
`\(p\)`-values [15].

---

### Challenges

Batch effect correction can introduce subtle problems of its own.

.pull-three-quarters-left[
**Alignability**: Batch effect correction methods will attempt to match batches
regardless of whether there is actually any shared biology [16].
]

---

### Challenges

Batch effect correction can introduce subtle problems of its own.

**Supervision**: Methods that use a biological group label can introduce spurious
associations with that label [17].

---

---

### Sensitivity and Simulation

1. Run the analysis separately across batches. Many of the results should
overlap if the biology is truly shared across batches.

1. Use a simulation to better understand the pipeline. There are many packages
for learning a simulator based on an initial dataset [18].

---

### Simulation and Supervised Normalization

Gerry Tonkin-Hill has an excellent re-analysis [17] of the data from [2] which sheds light what were likely the source of the phantom signals. The first part is a simulation.

---

### Simulation and Supervised Normalization

This setting is balanced -- each condition is equally likely across batches. In
this case, SVN batch effect correction [19]
works well.

---

### Simulation and Supervised Normalization

But what happens if there is imbalance?

---

### Simulation and Supervised Normalization

In this case, the SVN correction introduces an artificial difference.

This should cause concern about the original analysis: Hospitals specialize in
cancer types. Then again, this simulation is quite unrealistic.

---

### Experimental Design

1. To better account for batch effects, we should think carefully about
replication and controls [20; 21; 22].

- Replication: Generate multiple measurements of the same sample.
  - Controls: Samples or features where the signal is known to be absent.

1. Any variation that is only present after a step of data generation must be
due to the measurement mechanism, rather than the true biology.

---

### TCGA Data Revisited

1. [2]'s data had been sequenced at several centers, and
they used SVN to remove center-level batch effects.

1. After [3] raised concerns about the paper, they
decided to rerun the analysis restricted to data sequenced at Harvard Medical.

---

### TCGA Data Revisited

However, even though the data were sequenced at the same center, they were
collected by different hospitals.

---

### TCGA Data Revisited

If you remove the hospital-level batch effect, the signal more or less vanishes.
Either applying or omitting a batch effect correction method can lead to
spurious effects! It is important to be skeptical and carefully consider the
experimental design.

---

### CellAnova

1. CellAnova [14] takes this thinking a step further by
formalizing an approach to defining and using control samples.

1. Across all batches, control samples are assumed to have the same underlying
biology, and this gives confidence in the resulting correction.

---

### CellAnova

The CellAnova paper uses notation that stacks the expressions like those we saw
in ComBat.

`\begin{align*}
\*y_{i} = \mu + \*W \*x_{i} + \*b_{m(i)} + \Lambda_{m(i)}\*\epsilon_i
\end{align*}`

vs.

`\begin{align*}
\*Y = \*M + \*X \*W^\top + \*B \*V^\top
\end{align*}`

---

### Conclusion

1. **Garbage In, Garbage Out**: If the main source of variation in the data comes
from batch effects, then even the most sophisticated methodology will fail to
discover interesting biology.

1. **Approaches**: Correction methods estimate batch-specific transformations of
shared biology. These transformations are then reversed to define the
correction.

1. **Checks**: It's important not to use batch effect correction methods
thoughtlessly.  Consider applying simulation and leveraging the experimental
design.

---

### References

[1] C. Offord. "Journal retracts influential cancer microbiome paper". In: _AAAS Articles DO Group_ (Jun. 2024). DOI:
[10.1126/science.z2xkuua](https://doi.org/10.1126%2Fscience.z2xkuua). URL: [http://dx.doi.org/10.1126/science.z2xkuua](http://dx.doi.org/10.1126/science.z2xkuua).

[2] G. D. Poore et al. "RETRACTED ARTICLE: Microbiome analyses of blood and tissues suggest cancer diagnostic approach". In: _Nature_ 579.7800 (Mar. 2020), p. 567–574. ISSN:
1476-4687. DOI: [10.1038/s41586-020-2095-1](https://doi.org/10.1038%2Fs41586-020-2095-1). URL:
[http://dx.doi.org/10.1038/s41586-020-2095-1](http://dx.doi.org/10.1038/s41586-020-2095-1).

[3] A. Gihawi et al. "Major data analysis errors invalidate cancer microbiome findings". In: _mBio_ 14.5 (Oct. 2023). Ed. by I. B. Zhulin. ISSN: 2150-7511. DOI:
[10.1128/mbio.01607-23](https://doi.org/10.1128%2Fmbio.01607-23). URL: [http://dx.doi.org/10.1128/mbio.01607-23](http://dx.doi.org/10.1128/mbio.01607-23).

[4] G. D. Sepich-Poore et al. "Reply to: Caution Regarding the Specificities of Pan-Cancer Microbial Structure".  (Feb. 2023). DOI:
[10.1101/2023.02.10.528049](https://doi.org/10.1101%2F2023.02.10.528049). URL: [http://dx.doi.org/10.1101/2023.02.10.528049](http://dx.doi.org/10.1101/2023.02.10.528049).

[5] G. D. Sepich-Poore et al. "Robustness of cancer microbiome signals over a broad range of methodological variation". In: _Oncogene_ 43.15 (Feb. 2024), p. 1127–1148. ISSN:
1476-5594. DOI: [10.1038/s41388-024-02974-w](https://doi.org/10.1038%2Fs41388-024-02974-w). URL:
[http://dx.doi.org/10.1038/s41388-024-02974-w](http://dx.doi.org/10.1038/s41388-024-02974-w).

[6] R. Jiang et al. "Statistics or biology: the zero-inflation controversy about scRNA-seq data". In: _Genome biology_ 23.1 (2022), p. 31.

[7] A. Sarkar et al. "Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis". In: _Nature genetics_ 53.6 (2021), pp.
770-777.

[8] D. S. Goodsell et al. "From Atoms to Cells: Using Mesoscale Landscapes to Construct Visual Narratives". In: _Journal of Molecular Biology_ 430.21 (Oct. 2018), p.
3954–3968. ISSN: 0022-2836. DOI: [10.1016/j.jmb.2018.06.009](https://doi.org/10.1016%2Fj.jmb.2018.06.009). URL:
[http://dx.doi.org/10.1016/j.jmb.2018.06.009](http://dx.doi.org/10.1016/j.jmb.2018.06.009).

[9] J. T. Leek et al. "Tackling the widespread and critical impact of batch effects in high-throughput data". In: _Nature Reviews Genetics_ 11.10 (Sep. 2010), p. 733–739.
ISSN: 1471-0064. DOI: [10.1038/nrg2825](https://doi.org/10.1038%2Fnrg2825). URL: [http://dx.doi.org/10.1038/nrg2825](http://dx.doi.org/10.1038/nrg2825).

[10] Y. Wang et al. "Managing batch effects in microbiome data". In: _Briefings in Bioinformatics_ 21.6 (Nov. 2019), p. 1954–1970. ISSN: 1477-4054. DOI:
[10.1093/bib/bbz105](https://doi.org/10.1093%2Fbib%2Fbbz105). URL: [http://dx.doi.org/10.1093/bib/bbz105](http://dx.doi.org/10.1093/bib/bbz105).

[11] B. Latour. "Science in action: How to follow scientists and engineers through society". In: _Harvard UP_ (1987).

[12] W. E. Johnson et al. "Adjusting batch effects in microarray expression data using empirical Bayes methods". In: _Biostatistics_ 8.1 (Apr. 2006), p. 118–127. ISSN:
1465-4644. DOI: [10.1093/biostatistics/kxj037](https://doi.org/10.1093%2Fbiostatistics%2Fkxj037). URL:
[http://dx.doi.org/10.1093/biostatistics/kxj037](http://dx.doi.org/10.1093/biostatistics/kxj037).

[13] I. Korsunsky et al. "Fast, sensitive and accurate integration of single-cell data with Harmony". In: _Nature Methods_ 16.12 (Nov. 2019), p. 1289–1296. ISSN: 1548-7105.
DOI: [10.1038/s41592-019-0619-0](https://doi.org/10.1038%2Fs41592-019-0619-0). URL:
[http://dx.doi.org/10.1038/s41592-019-0619-0](http://dx.doi.org/10.1038/s41592-019-0619-0).

---

### References

[14] Z. Zhang et al. "Signal recovery in single cell batch integration".  (May. 2023). DOI: [10.1101/2023.05.05.539614](https://doi.org/10.1101%2F2023.05.05.539614). URL:
[http://dx.doi.org/10.1101/2023.05.05.539614](http://dx.doi.org/10.1101/2023.05.05.539614).

[15] V. Nygaard et al. "Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses". In: _Biostatistics_
17.1 (Aug. 2015), p. 29–39. ISSN: 1465-4644. DOI: [10.1093/biostatistics/kxv027](https://doi.org/10.1093%2Fbiostatistics%2Fkxv027). URL:
[http://dx.doi.org/10.1093/biostatistics/kxv027](http://dx.doi.org/10.1093/biostatistics/kxv027).

[16] R. Ma et al. "Principled and interpretable alignability testing and integration of single-cell data". In: _Proceedings of the National Academy of Sciences_ 121.10 (Feb.
2024). ISSN: 1091-6490. DOI: [10.1073/pnas.2313719121](https://doi.org/10.1073%2Fpnas.2313719121). URL:
[http://dx.doi.org/10.1073/pnas.2313719121](http://dx.doi.org/10.1073/pnas.2313719121).

[17] G. Tonkin-Hill. _GitHub - gtonkinhill/TCGA\_analysis - github.com_. <https://github.com/gtonkinhill/TCGA_analysis>. [Accessed 21-06-2024]. 2023.

[18] K. Sankaran et al. "Semisynthetic Simulation for Microbiome Data Analysis".  (Oct. 2024). DOI: [10.1101/2024.10.14.618211](https://doi.org/10.1101%2F2024.10.14.618211).
URL: [http://dx.doi.org/10.1101/2024.10.14.618211](http://dx.doi.org/10.1101/2024.10.14.618211).

[19] B. H. Mecham et al. "Supervised normalization of microarrays". In: _Bioinformatics_ 26.10 (2010), pp. 1308-1315.

[20] J. A. Gagnon-Bartsch et al. "Using control genes to correct for unwanted variation in microarray data". In: _Biostatistics_ 13.3 (Nov. 2011), p. 539–552. ISSN:
1468-4357. DOI: [10.1093/biostatistics/kxr034](https://doi.org/10.1093%2Fbiostatistics%2Fkxr034). URL:
[http://dx.doi.org/10.1093/biostatistics/kxr034](http://dx.doi.org/10.1093/biostatistics/kxr034).

[21] L. Jacob et al. "Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed". In: _Biostatistics_ 17.1 (Aug. 2015), p.
16–28. ISSN: 1468-4357. DOI: [10.1093/biostatistics/kxv026](https://doi.org/10.1093%2Fbiostatistics%2Fkxv026). URL:
[http://dx.doi.org/10.1093/biostatistics/kxv026](http://dx.doi.org/10.1093/biostatistics/kxv026).

[22] D. Gerard et al. "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls". In: _Statistica Sinica_ (2021). ISSN: 1017-0405. DOI:
[10.5705/ss.202018.0345](https://doi.org/10.5705%2Fss.202018.0345). URL: [http://dx.doi.org/10.5705/ss.202018.0345](http://dx.doi.org/10.5705/ss.202018.0345).