class: title <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { myred: ["{\\color{myred}{#1}}", 1], mygreen: ["{\\color{mygreen}{#1}}", 1], reals: "{\\mathbb{R}}", "\*": ["{\\mathbf{#1}}", 1], diag: ["{\\text{diag}\\left({#1}\\right)}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .myred {color: #B4575C;} .mygreen {color: #5A8A80;} </style> ## Managing Batch Effects in Biological Data <div id="subtitle"> Kris Sankaran <br/> 23 | October | 2024 <br/> Lab: <a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> <br/> </div> <div id="subtitle_right"> BME 780: Methods in Quantitative Biology<br/> Slides: <a href="https://go.wisc.edu/1lv9js">go.wisc.edu/1lv9js</a><br/> </div> <!-- 50 minute talk --> --- ### Learning Outcomes 1. **Imagine**: Why might batch effects appear in your data? 1. **Derive**: Interpret algebraic expressions used to represent batch effect correction methods. 1. **Check**: Reason about checks that can give more confidence in a batch effect correction method's output. 1. **Improve**: Explain the potential for replication and controls to improve batch effect correction methods --- .center[ ## Motivation ] --- ### Cancer Microbiome Controversy [1] .center[ <img src="figures/retraction.png"/> ] In June 2024, Nature retracted a paper [2] the claimed identify microbiome signatures of cancer. This came after one year's worth of debate [3; 4; 5] about the data analysis. --- ### Cancer Microbiome Controversy [1] .center[ <img src="figures/retraction.png"/> ] The "disease signature" was an artifact resulting from the use of a batch effect correction method. Before we can understand the nuances of the story, we need to learn about batch effects and correction methods. --- ### Sources of Technical Variation There are many points at which technical variation arises. For example, for single-cell sequencing [6]: .pull-left[ * Sample Collection * Sample Storage * Library Preparation * cDNA Synthesis ] .pull-right[ * PCR Amplification * Sequencing * Read Assignment ] Differences at any step can lead to systematic differences between data collected across experimental runs ("batches"). This is especially problematic for studies with large sample sizes or multiple study sites. --- ### Measurement vs. Reality [7] .pull-left[ 1. In the same way that a photo can be blurry or obscured by glare, the measurements we obtain from sequencing can be obscured by technical factors. 1. Any data are only an imperfect snapshot of the biological scene we're seeking to understand. ] .pull-right[ <img src="figures/kensington.jpg" width=320/> <p class="caption"> The glare is a technical artifact of the measuring device. </p> ] --- ### Measurement vs. Reality [7] .pull-left[ 1. In the same way that a photo can be blurry or obscured by glare, the measurements we obtain from sequencing can be obscured by technical factors. 1. Any data are only an imperfect snapshot of the biological scene we're seeking to understand. ] .pull-right[ <img src="figures/goodsell_phagophore.jpg" width=300/> <p class="caption"> It is hard to precisely measure molecular systems. Artwork from [8]. </p> ] --- ### Improving Discovery .pull-left[ 1. Batch effects can mask real biological effects. 1. To discover more interesting patterns, we need to either remove these effects or account for them in the analysis. ] .pull-right[ <img src="figures/batch_effects_leek.png" width=700/> <p class="caption"> Read coverage across samples from the HapMap project, discussed in [9]. Horizontal blocks have the same sequencing date. </p> ] --- ### Improving Discovery .pull-left[ 1. Batch effects can mask real biological effects. 1. To discover more interesting patterns, we need to either remove these effects or account for them in the analysis. ] .pull-right[ <img src="figures/batch_effects_eva.png" width=700/> <p class="caption"> Batch effects across collection dates from a microbiome study, discussed in [10]. </p> ] --- ### Improving Credibility > If all goes well, then the article... has shown mammal kidney structure; if all goes badly, it shrinks to three hamsters in one laboratory in 1984. > > -- Bruno Latour in [11] To dispute a scientific fact, a dissenter can always argue that the conditions that led to its production are not generalizable. To guard against this, we need to understand batch-to-batch variation. --- .center[ ## Methods ] --- ### General Strategies 1. Correct: Subtract the batch effects before all other analysis. 1. Account: Directly model batch-to-batch variation in the analysis. | | Pros | Cons | |--|---|---| | Correct | Generality. Can be reused easily. | Can introduce its own artifacts. | | Account | Cohesion. Only one method needed for everything. | Time consuming. Bespoke development for each context. | We will review correction methods. --- ### Correction Methods .pull-left[ Any method must define, 1. Transformations: How might batch effects impact measurement? 1. Objective: How can we find the most plausible transformation for our data? ] .pull-right[ <img src="figures/batch_transformation.png"/> ] The best choice can depend on properties of the data and the experimental design, which is why there are so many available methods. --- ### Exemplar Methods Today and next week, we'll consider: 1. ComBat [12] (2007, 7.6K citations). Linear Model. 1. Harmony [13] (2019, 4.8K citations). Clustering + Weighted Least Squares. 1. CellAnova [14] (preprint, 2023). Clustering + Linear Model. --- ### ComBat - Setting ComBat was proposed to remove batch effects in microarray data with small sample sizes, `\(n \approx 5 - 20\)` per batch, and it has inspired many variants. 1. `\(\*y_i \in \reals^{G}\)`: Sample `\(i\)`'s expression levels across `\(G\)` genes. 1. `\(\*x_i \in \reals^{D}\)`: Potentially relevant biological factors. 1. `\(m\left(i\right) \in \left[1, \dots, M \right]\)`: Which of the `\(M\)` batches does sample `\(i\)` belong to? --- ### ComBat - Model Sample `\(i\)` is assumed to be drawn from: `\begin{align*} \*y_{i} = \mu + \*W \*x_{i} + \*b_{m(i)} + \Lambda_{m(i)}\*\epsilon_i \end{align*}` where `\(\Lambda_{m} = \diag{\delta_{mg}}\)` rescales the noise `\(\epsilon_i \sim \mathcal{N}\left(0, \diag{\sigma_g^2}\right)\)`. 1. The parameters `\(\*W\)` capture shared biological effects. 1. Batch effects cause systematic location-scale shifts. 1. Given `\(\*b_m\)` and `\(\*\Lambda_{m}\)`, we can remove their effect from the observed data. --- ### ComBat - Estimation Rather than estimating each gene `\(g\)` separately, ComBat uses a hierarchical model to share information across all genes. This improves stability in the corrections for genes with very high noise levels. .center[ <img src="figures/combat_fig3.gif" width=330/> ] <p class="caption"> By fitting the model simultaneously across genes, the estimates are "shrunk" towards the mean. </p> --- ### Harmony - Setting Harmony was proposed to integrate single-cell data across experiments. Since these data have much larger sample sizes, it's possible to learn a more complex transformation. .center[ <img src="figures/harmony_integration.png" width=840/> ] <p class="caption"> First cluster to find cell types and then make batches overlap within each cluster. </p> --- ### Harmony - Clustering .pull-left[ Soft clustering finds centroids `\(\*\mu_k\)` and responsibilities `\(\*r_i\)` to solve, `\begin{align*} \min_{\*\mu_k \in \reals^{G}, \*r_i \in \Delta^{K}} \sum_{i = 1}^{N} \left[\sum_{k = 1}^{K} r_{ik} \|\*y_i - \*\mu_{k}\|_{2}^{2} - \lambda H\left(\*r_{i}\right)\right] \end{align*}` where `\(H\left(\*r_i\right)\)` is the entropy of sample `\(i\)`'s responsibilities. ] .pull-right[ <img src="figures/soft_clustering.png" width=350 align="right"/> ] --- ### Harmony - Clustering They use a chi-square statistic to encourage clusters to contain an evan mix across batches (strong associations lead to large values of the statistic). `\begin{align*} \min_{\*\mu_k \in \reals^{G}, \*r_i \in \Delta^{K}} \sum_{i = 1}^{N} &\left[\sum_{k = 1}^{K}r_{ik}\|\*y_i - \*\mu_{k}\|_{2}^{2} - \lambda H\left(\*r_{i}\right)\right] +\chi^2\left(\text{batch}, \text{cluster}\right) \end{align*}` .center[ <img src="figures/harmony_integration-step2.png" width=800/> ] --- ### Harmony - Correction They use weighted least squares to estimate batch means for each cluster, `\begin{align*} \sum_{k} r_{ik} \|\*y_i - \mu_k - \*b^k_{m\left(i\right)}\|_{2}^2 \end{align*}` The corrected data are `\(\tilde{\*y}_i = \sum_{k}r_{ik}\mu_k\)`. .center[ <img src="figures/harmony_integration-step3.png" width=800/> ] --- ### Challenges Batch effect correction can introduce subtle problems of its own. **Hypothesis Testing**: When the batches are highly imbalanced, correction can distort the null distribution of test statistics. This invalidates the usual `\(p\)`-values [15]. .center[ <img src="figures/nygaard_pvals.png" width=880/> ] --- ### Challenges Batch effect correction can introduce subtle problems of its own. .pull-three-quarters-left[ **Alignability**: Batch effect correction methods will attempt to match batches regardless of whether there is actually any shared biology [16]. ] .pull-three-quarters-right[ <img src="figures/alignability.jpg" width=200 align="right"/> ] --- ### Challenges Batch effect correction can introduce subtle problems of its own. **Supervision**: Methods that use a biological group label can introduce spurious associations with that label [17]. .center[ <img src="figures/retraction.png" width = 600/> ] --- .center[ ## Disciplined Design and Analysis ] --- ### Sensitivity and Simulation 1. Run the analysis separately across batches. Many of the results should overlap if the biology is truly shared across batches. 1. Use a simulation to better understand the pipeline. There are many packages for learning a simulator based on an initial dataset [18]. --- ### Simulation and Supervised Normalization Gerry Tonkin-Hill has an excellent re-analysis [17] of the data from [2] which sheds light what were likely the source of the phantom signals. The first part is a simulation. .pull-left[ <img src="figures/gth_sim1.png"/> ] .pull-right[ <img src="figures/gth_sim2.png"/> ] --- ### Simulation and Supervised Normalization This setting is balanced -- each condition is equally likely across batches. In this case, SVN batch effect correction [19] works well. .pull-left[ <img src="figures/gth_sim3.png"/> ] .pull-right[ <img src="figures/gth_sim4.png"/> ] --- ### Simulation and Supervised Normalization But what happens if there is imbalance? .pull-left[ <img src="figures/gth_sim5.png"/> ] .pull-right[ <img src="figures/gth_sim6.png"/> ] --- ### Simulation and Supervised Normalization In this case, the SVN correction introduces an artificial difference. .center[ <img src="figures/gth_sim7.png" width=500/> ] This should cause concern about the original analysis: Hospitals specialize in cancer types. Then again, this simulation is quite unrealistic. --- ### Experimental Design 1. To better account for batch effects, we should think carefully about replication and controls [20; 21; 22]. - Replication: Generate multiple measurements of the same sample. - Controls: Samples or features where the signal is known to be absent. 1. Any variation that is only present after a step of data generation must be due to the measurement mechanism, rather than the true biology. --- ### TCGA Data Revisited 1. [2]'s data had been sequenced at several centers, and they used SVN to remove center-level batch effects. 1. After [3] raised concerns about the paper, they decided to rerun the analysis restricted to data sequenced at Harvard Medical. .center[ <img src="figures/gth_real1.png" width=480/> ] --- ### TCGA Data Revisited However, even though the data were sequenced at the same center, they were collected by different hospitals. .center[ <img src="figures/gth_real2.png" width=600/> ] --- ### TCGA Data Revisited If you remove the hospital-level batch effect, the signal more or less vanishes. Either applying or omitting a batch effect correction method can lead to spurious effects! It is important to be skeptical and carefully consider the experimental design. .center[ <img src="figures/gth_real4.png" width=600/> ] --- ### CellAnova 1. CellAnova [14] takes this thinking a step further by formalizing an approach to defining and using control samples. 1. Across all batches, control samples are assumed to have the same underlying biology, and this gives confidence in the resulting correction. --- ### CellAnova The CellAnova paper uses notation that stacks the expressions like those we saw in ComBat. `\begin{align*} \*y_{i} = \mu + \*W \*x_{i} + \*b_{m(i)} + \Lambda_{m(i)}\*\epsilon_i \end{align*}` vs. `\begin{align*} \*Y = \*M + \*X \*W^\top + \*B \*V^\top \end{align*}` --- ### Conclusion 1. **Garbage In, Garbage Out**: If the main source of variation in the data comes from batch effects, then even the most sophisticated methodology will fail to discover interesting biology. 1. **Approaches**: Correction methods estimate batch-specific transformations of shared biology. These transformations are then reversed to define the correction. 1. **Checks**: It's important not to use batch effect correction methods thoughtlessly. Consider applying simulation and leveraging the experimental design. --- class: reference ### References [1] C. Offord. "Journal retracts influential cancer microbiome paper". In: _AAAS Articles DO Group_ (Jun. 2024). DOI: [10.1126/science.z2xkuua](https://doi.org/10.1126%2Fscience.z2xkuua). URL: [http://dx.doi.org/10.1126/science.z2xkuua](http://dx.doi.org/10.1126/science.z2xkuua). [2] G. D. Poore et al. "RETRACTED ARTICLE: Microbiome analyses of blood and tissues suggest cancer diagnostic approach". In: _Nature_ 579.7800 (Mar. 2020), p. 567–574. ISSN: 1476-4687. DOI: [10.1038/s41586-020-2095-1](https://doi.org/10.1038%2Fs41586-020-2095-1). URL: [http://dx.doi.org/10.1038/s41586-020-2095-1](http://dx.doi.org/10.1038/s41586-020-2095-1). [3] A. Gihawi et al. "Major data analysis errors invalidate cancer microbiome findings". In: _mBio_ 14.5 (Oct. 2023). Ed. by I. B. Zhulin. ISSN: 2150-7511. DOI: [10.1128/mbio.01607-23](https://doi.org/10.1128%2Fmbio.01607-23). URL: [http://dx.doi.org/10.1128/mbio.01607-23](http://dx.doi.org/10.1128/mbio.01607-23). [4] G. D. Sepich-Poore et al. "Reply to: Caution Regarding the Specificities of Pan-Cancer Microbial Structure". (Feb. 2023). DOI: [10.1101/2023.02.10.528049](https://doi.org/10.1101%2F2023.02.10.528049). URL: [http://dx.doi.org/10.1101/2023.02.10.528049](http://dx.doi.org/10.1101/2023.02.10.528049). [5] G. D. Sepich-Poore et al. "Robustness of cancer microbiome signals over a broad range of methodological variation". In: _Oncogene_ 43.15 (Feb. 2024), p. 1127–1148. ISSN: 1476-5594. DOI: [10.1038/s41388-024-02974-w](https://doi.org/10.1038%2Fs41388-024-02974-w). URL: [http://dx.doi.org/10.1038/s41388-024-02974-w](http://dx.doi.org/10.1038/s41388-024-02974-w). [6] R. Jiang et al. "Statistics or biology: the zero-inflation controversy about scRNA-seq data". In: _Genome biology_ 23.1 (2022), p. 31. [7] A. Sarkar et al. "Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis". In: _Nature genetics_ 53.6 (2021), pp. 770-777. [8] D. S. Goodsell et al. "From Atoms to Cells: Using Mesoscale Landscapes to Construct Visual Narratives". In: _Journal of Molecular Biology_ 430.21 (Oct. 2018), p. 3954–3968. ISSN: 0022-2836. DOI: [10.1016/j.jmb.2018.06.009](https://doi.org/10.1016%2Fj.jmb.2018.06.009). URL: [http://dx.doi.org/10.1016/j.jmb.2018.06.009](http://dx.doi.org/10.1016/j.jmb.2018.06.009). [9] J. T. Leek et al. "Tackling the widespread and critical impact of batch effects in high-throughput data". In: _Nature Reviews Genetics_ 11.10 (Sep. 2010), p. 733–739. ISSN: 1471-0064. DOI: [10.1038/nrg2825](https://doi.org/10.1038%2Fnrg2825). URL: [http://dx.doi.org/10.1038/nrg2825](http://dx.doi.org/10.1038/nrg2825). [10] Y. Wang et al. "Managing batch effects in microbiome data". In: _Briefings in Bioinformatics_ 21.6 (Nov. 2019), p. 1954–1970. ISSN: 1477-4054. DOI: [10.1093/bib/bbz105](https://doi.org/10.1093%2Fbib%2Fbbz105). URL: [http://dx.doi.org/10.1093/bib/bbz105](http://dx.doi.org/10.1093/bib/bbz105). [11] B. Latour. "Science in action: How to follow scientists and engineers through society". In: _Harvard UP_ (1987). [12] W. E. Johnson et al. "Adjusting batch effects in microarray expression data using empirical Bayes methods". In: _Biostatistics_ 8.1 (Apr. 2006), p. 118–127. ISSN: 1465-4644. DOI: [10.1093/biostatistics/kxj037](https://doi.org/10.1093%2Fbiostatistics%2Fkxj037). URL: [http://dx.doi.org/10.1093/biostatistics/kxj037](http://dx.doi.org/10.1093/biostatistics/kxj037). [13] I. Korsunsky et al. "Fast, sensitive and accurate integration of single-cell data with Harmony". In: _Nature Methods_ 16.12 (Nov. 2019), p. 1289–1296. ISSN: 1548-7105. DOI: [10.1038/s41592-019-0619-0](https://doi.org/10.1038%2Fs41592-019-0619-0). URL: [http://dx.doi.org/10.1038/s41592-019-0619-0](http://dx.doi.org/10.1038/s41592-019-0619-0). --- class: reference ### References [14] Z. Zhang et al. "Signal recovery in single cell batch integration". (May. 2023). DOI: [10.1101/2023.05.05.539614](https://doi.org/10.1101%2F2023.05.05.539614). URL: [http://dx.doi.org/10.1101/2023.05.05.539614](http://dx.doi.org/10.1101/2023.05.05.539614). [15] V. Nygaard et al. "Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses". In: _Biostatistics_ 17.1 (Aug. 2015), p. 29–39. ISSN: 1465-4644. DOI: [10.1093/biostatistics/kxv027](https://doi.org/10.1093%2Fbiostatistics%2Fkxv027). URL: [http://dx.doi.org/10.1093/biostatistics/kxv027](http://dx.doi.org/10.1093/biostatistics/kxv027). [16] R. Ma et al. "Principled and interpretable alignability testing and integration of single-cell data". In: _Proceedings of the National Academy of Sciences_ 121.10 (Feb. 2024). ISSN: 1091-6490. DOI: [10.1073/pnas.2313719121](https://doi.org/10.1073%2Fpnas.2313719121). URL: [http://dx.doi.org/10.1073/pnas.2313719121](http://dx.doi.org/10.1073/pnas.2313719121). [17] G. Tonkin-Hill. _GitHub - gtonkinhill/TCGA\_analysis - github.com_. <https://github.com/gtonkinhill/TCGA_analysis>. [Accessed 21-06-2024]. 2023. [18] K. Sankaran et al. "Semisynthetic Simulation for Microbiome Data Analysis". (Oct. 2024). DOI: [10.1101/2024.10.14.618211](https://doi.org/10.1101%2F2024.10.14.618211). URL: [http://dx.doi.org/10.1101/2024.10.14.618211](http://dx.doi.org/10.1101/2024.10.14.618211). [19] B. H. Mecham et al. "Supervised normalization of microarrays". In: _Bioinformatics_ 26.10 (2010), pp. 1308-1315. [20] J. A. Gagnon-Bartsch et al. "Using control genes to correct for unwanted variation in microarray data". In: _Biostatistics_ 13.3 (Nov. 2011), p. 539–552. ISSN: 1468-4357. DOI: [10.1093/biostatistics/kxr034](https://doi.org/10.1093%2Fbiostatistics%2Fkxr034). URL: [http://dx.doi.org/10.1093/biostatistics/kxr034](http://dx.doi.org/10.1093/biostatistics/kxr034). [21] L. Jacob et al. "Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed". In: _Biostatistics_ 17.1 (Aug. 2015), p. 16–28. ISSN: 1468-4357. DOI: [10.1093/biostatistics/kxv026](https://doi.org/10.1093%2Fbiostatistics%2Fkxv026). URL: [http://dx.doi.org/10.1093/biostatistics/kxv026](http://dx.doi.org/10.1093/biostatistics/kxv026). [22] D. Gerard et al. "Unifying and Generalizing Methods for Removing Unwanted Variation Based on Negative Controls". In: _Statistica Sinica_ (2021). ISSN: 1017-0405. DOI: [10.5705/ss.202018.0345](https://doi.org/10.5705%2Fss.202018.0345). URL: [http://dx.doi.org/10.5705/ss.202018.0345](http://dx.doi.org/10.5705/ss.202018.0345).