class: title # Enhancing Microbiome Analysis with Semisynthetic Data <style> .slide-background { background: url("figures/cover.png") no-repeat center center; background-size: cover; opacity: 0.5; } </style> <div id="subtitle_left"> Slides: <a href="https://go.wisc.edu/46xtw9">go.wisc.edu/46xtw9</a><br/> Paper: <a href="https://go.wisc.edu/p12o8w">go.wisc.edu/p12o8w</a><br/> Lab: <a href="https://measurement-and-microbes.org">measurement-and-microbes.org</a> <br/> </div> <div id="subtitle_right"> Kris Sankaran <br/> <a href="https://2025.mcbios.com/">MCBIOS: Microbiome Symphony</a><br/> 28 | March | 2025 <br/> </div> <!-- 13 minute talk --> --- ### Why Simulate? There are myriad opportunities for using simulation in microbiome analysis [1; 2]. They can help us to... <img src="figures/noun-benchmark-7569457.png" width=40/> **Benchmark methods** and identify gaps in the literature. <br/> <img src="figures/noun-labs-99456.png" width=40/> **Design experiments** that have high power to detect subtle signals. <br/> <img src="figures/noun-checkmark-7518321.png" width=40/> **Check conclusions** that might be sensitive to technical processing steps. --- ### Semisynthetic Data One of the major advances has been the design of algorithms that can leverage public data resources, like [3; 4; 5; 6]. * **Semisynthetic Data**: The output from a simulator that has been designed to mimic external, template data. * **Template Data**: Previously gathered experimental data that can be used to train a simulator. .center[ <img src="figures/template_defn.png" width=670/> ] --- ### Example 1: Multivariate Power Analysis How would you run a power analysis for Sparse Partial Least Squares Discriminant Analysis (SPLS-DA) [7]? .pull-left[ SPLS-DA helps with prediction when, * S: Not all features are predictive * PLS: Many features are correlated with one another * DA: The response is one of `\(K\)` classes Unfortunately, it doesn't come with any analytical power formulas. ] .pull-right[ <img src="figures/PLS-setup.png" width=500/> ] --- ### Example Output In this example, we are comparing mice with and without a mouse model of Type I diabetes (T1D). SPLS-DA helps us find taxa that distinguish healthy and disease groups. .pull-left[ <img src="figures/t1d-true-data.png"/> ] .pull-right[ <img src="figures/t1d-true-data-factors.png"/> ] --- ### Overall Approach How many samples are necessary before this method can recover the discriminating factors? * **Estimate**: Train a simulator on the original data. * **Alter/Sample**: Define negative control taxa with no association with T1D. * **Gather/Summarize**: Evaluate SPLS-DA performance on semisynthetic data with varying sample sizes and fractions of negative control taxa. .center[ <img src="figures/simulation_steps.png" width=700/> ] --- ### Copula Simulation It is important that our simulator reflect the correlation structure of the underlying data. We applied a Gaussian copula model with a high-dimensional covariance estimator. .center[ <img src="figures/gene-gene_dependence.png" width=850/> ] --- ### Bivariate Relationships Here are example bivariate relationships learned by the simulator. Do you see anything off? .center[ <img src="figures/splsda-correlations.png" width=1200/> ] --- ### Power Curves These are the results of our simulation experiment across varying sample sizes and proportions of truly associated taxa. When few taxa are truly predictive, many more samples are needed. .center[ <img src="figures/multivariate_power_isolated.png" width=700/> ] --- ### Example 2: Mediation Analysis Working with microbiologists and psychologists at UW-Madison, we re-analyzed a dataset about the gut-brain axis. .pull-left[ 1. We re-analyzed the study 2021 [8], which gathered data from 54 participants assigned to either a mindfulness training intervention or a waitlist control (n = 27 each). 1. The training lasted 2 months. Data were collected at the start, finish, and 2 month follow-up. ] .pull-right[ <img src="figures/design.png" width=450/> ] --- ### Mediation Analysis 1. We wondered whether mindfulness intervention might be affect behavior, which in turn influences microbiota composition. 2. To explore this, we applied a form of mediation analysis to the 16S microbiome and survey data [9; 10]. .center[ <img src="figures/mediation-dag-terms.png" width=550/> ] --- ### Mediation Analysis 1. We wondered whether mindfulness intervention might be affect behavior, which in turn influences microbiota composition. 2. To explore this, we applied a form of mediation analysis to the 16S microbiome and survey data [9; 10]. .center[ <img src="figures/mediation-dag.svg" width=450/> ] --- ### Synthetic Null Data We can alter the simulator so that some pathways are "turned off." Estimates derived from these data provide a reference null distribution .center[ <img src="figures/mindfulness-altered.png" width=780/><br/> <span style="font-size: 24px;"> The middle panel comes from a synthetic null: `\(T \nrightarrow M \to Y\)`. </span> ] --- ### Synthetic Null Hypothesis Testing We rank the effects learned from both the real and synthetic null reference data. The significance threshold is chosen to control the empirical false discovery based on synthetic data. .center[ <img src="figures/synthetic_null_testing.png" width=900/> ] --- ### Software and Resources All the examples I discussed today can be run from online tutorials we've written to accompany our papers: * Simulation for Microbiome Analysis ([go.wisc.edu/wnj5p9](https://go.wisc.edu/wnj5p9)) * Generative Models Examples ([go.wisc.edu/ax73qb](https://go.wisc.edu/ax73qb)) The relevant R packages behind these analyses are: * `multimedia` - Mediation analysis for microbiome data [11]. * `scDesign3` - An existing simulator for single cell data [12; 13; 14]. * `scDesigner` - Under-development version used in the first tutorial. --- Simulation turns abstract, conceptual questions into simple empirical ones. .center[ <img src="figures/simulation_summary.png" width=1000/> ] --- .center[ ### Thank you! ] * Contact: ksankaran@wisc.edu * Lab Members: Margaret Thairu, Shuchen Yan, Yuliang Peng, Helena Huang * Funding: NIGMS R01GM152744, NIAID R01AI184095 * Co-authors: Hanying Jiang, Xinran Miao, Mara Beebe, Dan W. Grupe, Richie Davidson, Jo Handelsman, Saritha Kodikara, Jingyi Jessica Li, Kim-Anh Lê Cao, Susan Holmes --- class: reference ### References [1] K. Sankaran et al. "Generative Models: An Interdisciplinary Perspective". In: _Annual Review of Statistics and Its Application_ 10.1 (Mar. 2023), p. 325–352. ISSN: 2326-831X. DOI: [10.1146/annurev-statistics-033121-110134](https://doi.org/10.1146%2Fannurev-statistics-033121-110134). URL: [http://dx.doi.org/10.1146/annurev-statistics-033121-110134](http://dx.doi.org/10.1146/annurev-statistics-033121-110134). [2] K. Sankaran et al. "Semisynthetic simulation for microbiome data analysis". En. In: _Brief. Bioinform._ 26.1 (Nov. 2024). [3] E. Pasolli et al. "Accessible, curated metagenomic data through ExperimentHub". In: _Nature Methods_ 14 (2017), pp. 1023-1024. URL: [https://api.semanticscholar.org/CorpusID:3403081](https://api.semanticscholar.org/CorpusID:3403081). [4] E. Muller et al. "The gut microbiome-metabolome dataset collection: a curated resource for integrative meta-analysis". In: _npj Biofilms and Microbiomes_ 8.1 (Oct. 2022). ISSN: 2055-5008. DOI: [10.1038/s41522-022-00345-5](https://doi.org/10.1038%2Fs41522-022-00345-5). URL: [http://dx.doi.org/10.1038/s41522-022-00345-5](http://dx.doi.org/10.1038/s41522-022-00345-5). [5] Felix G.M. Ernst <felix.gm.ernst@outlook.com> [aut, cre] (<https://orcid.org/0000-0001-5064-0928>), Leo Lahti [aut] (<https://orcid.org/0000-0001-5537-637X>), Sudarshan Shetty <sudarshanshetty9@gmail.com> [aut] (<https://orcid.org/0000-0001-7280-9915>). _microbiomeDataSets_. 2021. [6] _Home - National Microbiome Data Collaborative - microbiomedata.org_. <https://microbiomedata.org/>. [Accessed 17-02-2025]. [7] F. Rohart et al. "mixOmics: An R package for 'omics feature selection and multiple data integration". En. In: _PLoS Comput. Biol._ 13.11 (Nov. 2017), p. e1005752. [8] D. W. Grupe et al. "The Impact of Mindfulness Training on Police Officer Stress, Mental Health, and Salivary Cortisol Levels". In: _Frontiers in Psychology_ 12 (Sep. 2021). ISSN: 1664-1078. DOI: [10.3389/fpsyg.2021.720753](https://doi.org/10.3389%2Ffpsyg.2021.720753). URL: [http://dx.doi.org/10.3389/fpsyg.2021.720753](http://dx.doi.org/10.3389/fpsyg.2021.720753). [9] K. Imai et al. "A general approach to causal mediation analysis." In: _Psychological methods_ 15.4 (2010), p. 309. [10] M. B. Sohn et al. "Compositional mediation analysis for microbiome studies". In: _The Annals of Applied Statistics_ 13.1 (2019), pp. 661-681. [11] H. Jiang et al. "Multimedia: multimodal mediation analysis of microbiome data". In: _Microbiology Spectrum_ 13.2 (Feb. 2025). Ed. by J. Claesen. ISSN: 2165-0497. DOI: [10.1128/spectrum.01131-24](https://doi.org/10.1128%2Fspectrum.01131-24). URL: [http://dx.doi.org/10.1128/spectrum.01131-24](http://dx.doi.org/10.1128/spectrum.01131-24). [12] W. V. Li et al. "A statistical simulator scDesign for rational scRNA-seq experimental design". In: _Bioinformatics_ 35.14 (Jul. 2019), p. i41–i50. ISSN: 1367-4811. DOI: [10.1093/bioinformatics/btz321](https://doi.org/10.1093%2Fbioinformatics%2Fbtz321). URL: [http://dx.doi.org/10.1093/bioinformatics/btz321](http://dx.doi.org/10.1093/bioinformatics/btz321). [13] T. Sun et al. "scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured". In: _Genome Biology_ 22.1 (May. 2021). ISSN: 1474-760X. DOI: [10.1186/s13059-021-02367-2](https://doi.org/10.1186%2Fs13059-021-02367-2). URL: [http://dx.doi.org/10.1186/s13059-021-02367-2](http://dx.doi.org/10.1186/s13059-021-02367-2). --- class: reference ### References [14] D. Song et al. "scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics". In: _Nature Biotechnology_ 42.2 (May. 2023), p. 247–252. ISSN: 1546-1696. DOI: [10.1038/s41587-023-01772-1](https://doi.org/10.1038%2Fs41587-023-01772-1). URL: [http://dx.doi.org/10.1038/s41587-023-01772-1](http://dx.doi.org/10.1038/s41587-023-01772-1). --- ### Evaluation Taxonomy This is how some common techniques fall into this taxonomy. * **Graphical, Narrow**: Boxplots or cumulative distribution function plots comparing real vs. simulated taxa (like the DA example). * **Graphical, Broad**: Principal component plots of real vs. simulated dataset. * **Quantitative, Narrow**: Two-sample Kolmogorov-Smirnov test. * **Quantitative, Broad**: Evaluation through classification (next example). * **Fit-for-Purpose**: Linear model coefficients on real vs. simulated data ([2], "Batch Effect Correction"). --- ### SPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figures/PLS-step1.png" width=500/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### SPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figures/PLS-step2.png" width=240/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### SPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figures/PLS-step3.png" width=400/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### SPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figures/PLS-step4.png" width=150/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### SPLS-DA Intuition Now we can compare samples from the two tables in a single, shared space. .center[ <img src="figures/PLS-step5.png" width=800/> ] --- ### SPLS-DA Intuition Now we can compare samples from the two tables in a single, shared space. .center[ <img src="figures/PLS-step6.png" width=800/> ] --- ### SPLS-DA Intuition To get more than one dimension, we can repeat this process after removing any correlation with previously found patterns. .center[ <img src="figures/PLS-step7.png" width=800/> ] --- ### Copula Models More formally, let `\(F_{1}, \dots, F_{D}\)` be the target margins and let `\(\Phi\)` be the CDF of the Gaussian distribution. Gaussian Copula modeling has these steps. Estimate: 1. Gaussianize the observed `\(\mathbf{x}_{i}\)` to `\(\mathbf{z}_{i} := \left[\Phi^{-1}\left(F_{1}\left(x_{i1}\right)\right), \dots, \Phi^{-1}\left(F_{D}\left(x_{iD}\right)\right)\right]\)` 1. Estimate the covariance `\(\hat{\Sigma}\)` associated with `\(z_{i}\)` Simulate: 1. Draw `\(\mathbf{z}^\ast \sim \mathcal{N}\left(0, \Sigma\right)\)` 1. Transform back `\(\mathbf{x}^{\ast} := \left[F_{1}^{-1}\left(\Phi\left(z_{i1}^\ast\right)\right), \dots, F_{D}^{-1}\left(\Phi\left(z_{iD}^\ast\right)\right)\right]\)` --- ### Real vs. Simulated Correlation .center[ <img src="figures/correlation_histogram.png" width=700/> ] A detailed explanation is given [here](https://krisrs1128.github.io/microbiome-simulation/multivariate-power-analysis.html#evaluation). --- ### Tuning High-Dimensional Covariance Estimator .center[ <img src="figures/covariance_hyperparameter_errors.png" width=700/> ] A detailed explanation is given [here](https://krisrs1128.github.io/microbiome-simulation/multivariate-power-analysis.html#evaluation). --- ### Intuition * In the Gaussianized space, it's easy to model correlation. * The mapping back and forth is possible because we know the margins `\(F\)`. - `\(\Phi\)` represents the Gaussian CDF applied componentwise <br/> <br/> .center[ <img src="figures/copula_transformation.png" width=700/> ] --- ### Pilot Study .pull-left[ 1. We re-analyzed a pilot study from 2021 [8], which gathered data from 54 participants randomly assigned to either a mindfulness training intervention or a waitlist control (n = 27 each). 1. The training lasted 2 months. Data were collected at the start, finish, and 2 month follow-up. ] .pull-right[ <img src="figures/design.png" width=450/> ] --- ### Estimated Indirect Effects These figures summarize the paths `\(T \to M \to Y\)`.</br> (i.e., color `\(\to\)` x-axis `\(\to\)` y-axis). .center[ <img src="figures/mindfulness-indirect-effects.png" width=900/> ] --- ### Figure Sources frustration by Rikas Dzihab from <a href="https://thenounproject.com/browse/icons/term/frustration/" target="_blank" title="frustration Icons">Noun Project</a> (CC BY 3.0) confused by Rikas Dzihab from <a href="https://thenounproject.com/browse/icons/term/confused/" target="_blank" title="confused Icons">Noun Project</a> (CC BY 3.0) Benchmark by Sofiah from <a href="https://thenounproject.com/browse/icons/term/benchmark/" target="_blank" title="Benchmark Icons">Noun Project</a> (CC BY 3.0) checkmark by Asa Kharisma Dini from <a href="https://thenounproject.com/browse/icons/term/checkmark/" target="_blank" title="checkmark Icons">Noun Project</a> (CC BY 3.0) Lab glassware by Vectors Market from <a href="https://thenounproject.com/browse/icons/term/lab-glassware/" target="_blank" title="Lab glassware Icons">Noun Project</a> (CC BY 3.0)