class: title # Enhancing Microbiome Analysis with Semisynthetic Data <style> .slide-background { background: url("figures/cover.png") no-repeat center center; background-size: cover; opacity: 0.5; } </style> <div id="subtitle"> Kris Sankaran <br/> Plant Pathology Seminar<br/> 18 | February | 2025 <br/> </div> <div id="subtitle_right"> Slides: <a href="https://go.wisc.edu/689h7c">go.wisc.edu/689h7c</a><br/> Lab: <a href="https://measurement-and-microbes.org">measurement-and-microbes.org</a> <br/> </div> <!-- 55 minute talk --> --- ### Microbiome Data 1. A microbiome is a microbe-scale ecosystem. It can be described by taxonomic composition, genomic function, and biochemical features. 1. Advances in sequencing technology have made it easier than ever to rapidly profile these taxonomic and genomic features in a range of sites, including in the human gut, on plant roots, and in the oceans. .center[ <span style="font-size: 18px"> <img src="figures/spatial_earle.png" width=455/><br/> Spatial imaging of a microbial community along the gut lining, from [1]. </span> ] --- ### Statistical Challenges Developing the data analysis for a microbiome study can be complicated by a number of factors. * **Integration**: How should we transform and analyze data across batches or technologies, each with unique sources of technical variability? * **Experimental Design**: How should we arrange sampling, assign treatments, and place controls so that we can have powerful statistical conclusions? * **Reproducibility**: How can we be sure our conclusions are trustworthy? --- ### Data Analysis Controversy .center[ <img src="figures/retraction.png"/> ] In June 2024, _Nature_ retracted a paper [2] the claimed identify microbiome signatures of cancer. This came after one year's worth of debate [3; 4] about the data analysis. --- ### Data Analysis Controversy .center[ <img src="figures/retraction.png"/> ] The "disease signature" was an artifact resulting from the use of a batch effect correction method. Before we can understand the nuances of the story, we need to learn about batch effects and correction methods. --- ### Simulation to Resolve Controversy Gerry Tonkin-Hill has an excellent re-analysis [5] of the data from [2] which sheds light what were likely the source of the phantom signals. The first part is a simulation. .pull-left[ <img src="figures/gth_sim1.png"/> ] .pull-right[ <img src="figures/gth_sim2.png"/> ] --- ### The Changing Simulation Landscape Historically, microbiome researchers have only rarely used simulation in their data analysis workflow. .pull-left[ <img src="figures/noun-frustration-7442748.png" width=70/> The simulators would have to be written from scratch, which requires significant effort. ] .pull-right[ <img src="figures/noun-confused-7442754.png" width=70/> Even afterwards, the resulting data may not be realistic enough to use to guide any practical conclusions. ] I know this firsthand from writing my PhD thesis [6]... but the situation seems to be improving quite rapidly!! --- ### Semisynthetic Data One of the major advances has been the design of algorithms that can leverage public data resources, like [7; 8; 9; 10]. * **Semisynthetic Data**: The output from a simulator that has been designed to mimic external, template data. * **Template Data**: Previously gathered experimental data that can be used to train a simulator. .center[ <img src="figures/template_defn.png" width=670/> ] --- ### New Packages We also have many more packages that implement these new methods. Here are 6 out of the 11 packages discussed in our review [11]. .center[ <img src="figures/simulation_packages_table.png" width=850/> ] --- ### Talk Outline This talk gives examples of how semisynthetic data can help microbiome data analysis. It is based on references [12; 13; 11]. <hr/> <div class="outline-container"> <div class="act-header"> <span class="tilde"></span> Act I: Benchmarking and Power Analysis <span class="tilde"></span> </div> <div class="sub-item">Differential Abundance</div> <div class="sub-item">Dimensionality Reduction</div> <div class="interlude"> <span class="tilde"></span> Interlude on Evaluation <span class="tilde"></span> </div> <div class="act-header"> <span class="tilde"></span> Act II: Reliability and Mediation <span class="tilde"></span> </div> <div class="sub-item">Data Integration</div> <div class="sub-item">Mediation Analysis</div> </div> <hr/> --- ## Act I: Benchmarking and Power Analysis --- ### Differential Abundance A common question in microbiome analysis is whether a given taxon is more vs. less abundant in some conditions vs. others. Formally, consider * Hypotheses of interest: `\(H_{1}, \dots, H_{M}\)`. Some of them are non-null, but you don't know which. * Associated `\(p\)`-values: `\(p_{1}, \dots, p_{M}\)`. Goal: Reject as many non-null hypotheses as possible while controlling the _False Discovery Rate_ [14; 15], `\begin{align*} \text{FDR} := \mathbf{E}\left[\frac{|\text{False Positives}|}{|\text{Rejections}| \vee 1}\right] \end{align*}` --- ### Simulation Example We can define benchmark using our data. In this example, we 1. Train the simulator to mimic 130 genera from the study [16], allowing means and variances to depend on BMI group. 1. Define _computational negative controls_ by removing effects from the genera with the weakest effects (insigificant at a cutoff `\(q = 0.1\)`). .center[ <img src="figures/bmi_effect_cartoon.png" width=910/> ] --- ### Example Simulated Data The semisynthetic data seems to capture group and taxa differences among these highly abundant taxa. .center[ <img src="figures/contrast-simulators-1.png" width=1000/> ] --- ### Benchmarking Analysis .pull-three-quarters-left[ <img src="figures/visualize-da-power-1.png" width=700/> ] .pull-three-quarters-right[ All methods control the FDR. LIMMA has high variability in performance. Power only plateaus around `\(n = 1000\)` samples. ] --- ### Implementation For this simulator, we used a zero-inflated Negative Binomial variant of the scDesign3 model [17]. For the abundance of taxon `\(j\)` in sample `\(i\)`, we used: `\begin{align*} X_{ij} \sim \text{ZINB}\left(\mu_{g\left(i\right)j}, \varphi_{g\left(i\right)j}, \nu_{j}\right) \end{align*}` where `\(g\left(i\right)\)` is the BMI category of sample `\(i\)` and where `\(\mu, \varphi\)`, and `\(\nu\)` are mean, dispersion, and zero-inflation parameters, respectively. .center[ <img src="figures/zinb_cartoon.png" width=350/> ] --- ### Community-wide Associations In many problems, we are interested in the relationships across a collection of taxa. These analysis require more advanced methods, like network [18; 19] or dimensionality reduction [20] (see figure below) techniques. .center[ <img src="figures/antibiotic_prototypes.png" width=840/><br/> <span style="font-size: 18px;"> </span> ] --- ### Motivation: Power Analysis .pull-left[ 1. Power analyses are intended to prevent researchers from embarking on studies that have very little chance of detecting the hypothesized signals. 1. While there are formulas for certain univariate tests, there aren't any for more complex, multivariate models. ] .pull-right[ <img src="figures/lenth_power_calculator.png" width=500/> <span style="font-size: 18px;"> Calculator from Russ Lenth's power and sample size webpage [21]. </span> ] --- ### SPLS-DA Setting Our power analysis uses Sparse Partial Least Squares Discriminant Analysis (SPLS-DA) [22]. This topic could be its own full workshop, but let's review the core ideas. .pull-left[ SPLS-DA helps with prediction when, * S: Not all features are predictive * PLS: Many features are correlated with one another * DA: The response is one of `\(K\)` classes ] .pull-right[ <img src="figures/PLS-setup.png" width=500/> ] --- ### Example Output In this example, we are comparing mice with and without a mouse model of Type I diabetes (T1D). SPLS-DA helps us find taxa that distinguish healthy and disease groups. .pull-left[ <img src="figures/t1d-true-data.png"/> ] .pull-right[ <img src="figures/t1d-true-data-factors.png"/> ] --- ### Problem Formulation How many samples are necessary before this method can recover the discriminating factors? * **Estimate**: Train a simulator on the original data. * **Alter/Sample**: Define negative control taxa with no association with T1D. * **Gather/Summarize**: Evaluate SPLS-DA performance on semisynthetic data with varying sample sizes and fractions of negative control taxa. .center[ <img src="figures/simulation_steps.png" width=700/> ] --- ### Bivariate Relationships Here are example bivariate relationships learned by the simulator. Do you see anything off? .center[ <img src="figures/splsda-correlations.png" width=1200/> ] --- ### Power Analysis These are the results of our simulation experiment across varying sample sizes and proportions of truly associated taxa. When few taxa are truly predictive, many more samples are needed. .center[ <img src="figures/multivariate_power_isolated.png" width=700/> ] --- ### Copula Models These are a type of model that "couple" a collection of known marginal distributions . .center[ <img src="figures/gene-gene_dependence.png" width=900/> ] --- ### Starting Point If we were asked to simulate a vector of five correlated variables on our computers right now, what would be the easiest thing to do? --- ### Starting Point If we were asked to simulate a vector of five correlated variables on our computers right now, what would be the easiest thing to do? ``` r library(mvtnorm) D <- 5 ones <- rep(1, D) Sigma <- 0.01 * diag(D) + 0.99 * ones %*% t(ones) rmvnorm(3, rep(0, D), Sigma) ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] -0.9800037 -0.9979774 -0.9419255 -0.8202038 -0.8749182 ## [2,] 1.4060867 1.4352558 1.4819112 1.4008727 1.3636058 ## [3,] 0.7326696 0.6326349 0.6412391 0.6804125 0.6181781 ``` The difficulty is that we usually want non-Gaussian margins `\(F_{1}, \dots, F_{D}\)`. --- ### Intuition * In the Gaussianized space, it's easy to model correlation. * The mapping back and forth is possible because we know the margins `\(F\)`. - `\(\Phi\)` represents the Gaussian CDF applied componentwise <br/> <br/> .center[ <img src="figures/copula_transformation.png" width=700/> ] --- ### Variations 1. We might expect the corelation structure to vary across groups. This can be accomplished by setting separate `\(\Sigma_{k}\)` across groups `\(k\)`. 1. In high-dimensions, the sample covariance `\(\hat{\Sigma}\)` can destabilize. In this case, we should use high-dimensional covariance estimators [23; 24]. .center[ <img src="figures/copula_groups.png" width=700/> ] --- ## Interlude: Evaluating Simulators --- ### Evaluation Taxonomy To be useful, simulated data need to be realistic. A few differences to be aware of: * **Narrow/Broad Measures**: Narrow measures focus on small subsets of taxa, while broad measures evaluate community-level properties. * **Graphical/Quantative**: Some checks are more easily quantifiable. * **Fit-for-purpose measures**: Evaluation can focus on specific parameter estimates or analysis results. Different types of realism should have higher priority depending on the downstream tasks. --- ### Evaluation through Classification What type of model would you use to simulate data like this? <center> <img src="figures/true_mixture.png" width="550"/> </center> --- * A natural enough starting point is a Gaussian mixture model with `\(K = 4\)`. * We can simulate from the fit, but it seems quite far off. .pull-left[ _Simulated_ <img src="figures/Gaussian (Shared Covarince).png" width="480"/> ] .pull-right[ _Truth_ <img src="figures/true_mixture.png" width="480"/> ] --- We make our assessment quantitative using the discriminator idea of [25]. The prediction probabilies below come from a gradient boosting model. Its out-of-sample accuracy is 65.5%. .pull-left[ _Simulated_ <img src="figures/Gaussian (Shared Covarince)-prob.png" width="480"/> ] .pull-right[ _Truth_ <img src="figures/true-Gaussian (Shared Covariance)-prob.png" width="480"/> ] --- As a next step, we increase number of components to `\(K = 5\)` and fit different variances per component. We still over-sample the gap between the two bottom-left clusters, but the GBM accuracy has dropped to 55.5%. .pull-left[ _Simulated_<br/> <img src="figures/Gaussian (Individual Covariance)-prob.png" width="440"/> ] .pull-right[ _Truth_<br/> <img src="figures/true-Gaussian (Individual Covariance)-prob.png" width="440"/> ] --- * We use a mixture of `\(t\)` distributions next. * GBM accuracy is now 50.6% - Unsurprisingly, this is the true mechanism that generated the data. .pull-left[ _Simulated_ <img src="figures/Student's t (Individual Covariance)-prob.png" width="500"/> ] .pull-right[ _Truth_ <img src="figures/true-Student's t (Individual Covariance)-prob.png" width="500"/> ] --- The discrimination probabilities become closer to 0.5 the more accurate the simulation becomes. <center> <img src="figures/modeling_iteration.png" width="800"/> </center> --- ## Act II: Reliability and Mediation --- ### Reliability Checks 1. Beyond power and benchmarking analysis, simulations can clarify how to interpret a complicated workflow. 1. Following the lead of [26; 27], we have been calling this a *reliability check*. These checks construct hypothetical scenarios to understand how methods behave. <div style="margin-left: 100px;"> <span style="font-family: 'Exo 2'; font-size: 18;"> The analysis should not...<br/> introduce spurious signals.<br/> give high confidence results on uncertain data.<br/> yield very different answers on similar datasets.<br/> drown out subtle effects.<br/> etc... </span> </div> --- ### Vertical Data Integration To illustrate, let's consider a vertical data integration question [28]. These are problems where we get complementary 'omics views of the same samples. .center[ <img src="figures/vertical_integration.png" width="1000"/> ] The goal is to prepare a unified analysis which considers relationships across sources. --- ### ICU Example .pull-left[ The study [29] used amplicon sequencing data to profile the bacterial, viral, and fungal composition in the gut microbiome samples from ICU patients at a hospital, including a subset who were experiencing sepsis. ] .pull-right[ <img src="figures/transkingdom_summary.png" width="500"/> ] --- ### Multiblock SPLS-DA Analysis Multiblock SPLS-DA generalizes SPLS-DA to incorporate measurements across multiple tables [30]. With `\(\texttt{sepsis} \times \texttt{antibiotics}\)` status as the response variable, the method outputs the plots below. .center[ <img src="figures/multiblock_original.svg" width="1000"/> ] --- ### Reliability Check It's not obvious how we should interpret this output. For example, the virus data must influence the bacteria plot, because the method integrated across sources, but how strong is the influence? .center[ <img src="figures/multiblock_original.svg" width="1000"/> ] Some integration methods are more vs. less aggressive than others. --- ### Semisynthetic Data To calibrate our interpretation, we first fit a simulator using all data. We then deliberately remove all associations between the bacteria community profiles and sepsis status. .center[ <img src="figures/semisynthetic_transkingdom.png" width=820/> ] --- ### Simulation Results Applying Multiblock SPLS-DA to these data suggests that we are in an "aggressive integration" regime. .center[<img src="figures/multiblock_calibration.png" width=780/>] A reliability check like this might have helped [2] realize that their normalization procedure introduced spurious associations. --- ### Model Comparison It's common to compare models using `\(R^{2}\)` or prediction performance. Less well known is that we can _also_ use semisynthetic data. This works even when regression language is insufficient. .center[ <img src="figures/model_comparison.gif" width=600/> ] --- ### Mindfulness Interventions This type of model comparison was helpful in an ongoing collaboration with Jo Handelsman (Plant Pathology) and Richie Davidson (Psychology and Psychiatry). The driving question is: <br/> <br/> <div style="font-size: 32px; font-family: 'Nunito'; margin-left: 100px;"> Is it possible to improve psychiatric treatment for a patient using knowledge of their microbiome? </div> <br/> <br/> Indeed, there is growing evidence for a relationship between the microbiome and psychiatric conditions, both in mouse models and in observational human studies [31; 32; 33; 34]. --- ### Aside: Event in Two Weeks .center[ <img src="figures/crossroads.png" width=800/> ] --- ### Pilot Study .pull-left[ 1. We re-analyzed a pilot study from 2021 [35], which gathered data from 54 participants randomly assigned to either a mindfulness training intervention or a waitlist control (n = 27 each). 1. The training lasted 2 months. Data were collected at the start, finish, and 2 month follow-up. ] .pull-right[ <img src="figures/design.png" width=450/> ] --- ### Mediation Analysis 1. We were concerned that the mindfulness intervention might be affect behavior, which in turn influences microbiota composition. 2. To explore this, we applied a form of mediation analysis to the 16S microbiome and survey data . .center[ <img src="figures/mediation-dag-terms.png" width=550/> ] --- ### Mediation Analysis 1. We were concerned that the mindfulness intervention might be affect behavior, which in turn influences microbiota composition. 2. To explore this, we applied a form of mediation analysis to the 16S microbiome and survey data . .center[ <img src="figures/mediation-dag.svg" width=450/> ] --- ### Estimated Indirect Effects These figures summarize the paths `\(T \to M \to Y\)`.</br> (i.e., color `\(\to\)` x-axis `\(\to\)` y-axis). .center[ <img src="figures/mindfulness-indirect-effects.png" width=900/> ] --- ### Synthetic Null Data We can alter the simulator so that some pathways are "turned off." Estimates derived from these data provide a reference null distribution .center[ <img src="figures/mindfulness-altered.png" width=780/><br/> <span style="font-size: 24px;"> The middle panel comes from a synthetic null: `\(T \nrightarrow M \to Y\)`. </span> ] --- ### Synthetic Null Hypothesis Testing We can rank the effects learned from both the real and synthetic null reference data. The significance threshold is chosen to control the proportion of null estimates (false positives) that are included among the discoveries. .center[ <img src="figures/synthetic_null_testing.png" width=900/> ] --- ## Conclusion --- ### Software and Resources All the examples I discussed today can be run from online tutorials we've written to accompany our papers: * Simulation for Microbiome Analysis ([go.wisc.edu/wnj5p9](https://go.wisc.edu/wnj5p9)) * Generative Models Examples ([go.wisc.edu/ax73qb](https://go.wisc.edu/ax73qb)) The relevant R packages behind these analyses are: * `multimedia` - Mediation analysis for microbiome data. * `scDesign3` - An existing simulator for single cell data. * `scDesigner` - Under-development version used in the first tutorial. --- Simulation can turn a problem of logic into one of observation. .center[ <img src="figures/simulation_summary.png" width=800/> ] P.S. We need your help! We are looking for more examples to include as simulation workflows in `scDesigner`. If you have data or problems that could benefit from simulation, please reach out. --- .center[ ### Thank you! ] * Contact: ksankaran@wisc.edu * Lab Members: Margaret Thairu, Shuchen Yan, Yuliang Peng, Helena Huang * Funding: NIGMS R01GM152744, NIAID R01AI184095 * Co-authors: Hanying Jiang, Xinran Miao, Mara Beebe, Dan W. Grupe, Richie Davidson, Jo Handelsman, Saritha Kodikara, Jingyi Jessica Li, Kim-Anh Lê Cao, Susan Holmes --- class: reference ### References [1] K. A. Earle et al. "Quantitative Imaging of Gut Microbiota Spatial Organization". In: _Cell Host & Microbe_ 18.4 (Oct. 2015), p. 478–488. ISSN: 1931-3128. DOI: [10.1016/j.chom.2015.09.002](https://doi.org/10.1016%2Fj.chom.2015.09.002). URL: [http://dx.doi.org/10.1016/j.chom.2015.09.002](http://dx.doi.org/10.1016/j.chom.2015.09.002). [2] G. D. Poore et al. "RETRACTED ARTICLE: Microbiome analyses of blood and tissues suggest cancer diagnostic approach". In: _Nature_ 579.7800 (Mar. 2020), p. 567–574. ISSN: 1476-4687. DOI: [10.1038/s41586-020-2095-1](https://doi.org/10.1038%2Fs41586-020-2095-1). URL: [http://dx.doi.org/10.1038/s41586-020-2095-1](http://dx.doi.org/10.1038/s41586-020-2095-1). [3] A. Gihawi et al. "Major data analysis errors invalidate cancer microbiome findings". In: _mBio_ 14.5 (Oct. 2023). Ed. by I. B. Zhulin. ISSN: 2150-7511. DOI: [10.1128/mbio.01607-23](https://doi.org/10.1128%2Fmbio.01607-23). URL: [http://dx.doi.org/10.1128/mbio.01607-23](http://dx.doi.org/10.1128/mbio.01607-23). [4] G. D. Sepich-Poore et al. "Reply to: Caution Regarding the Specificities of Pan-Cancer Microbial Structure". (Feb. 2023). DOI: [10.1101/2023.02.10.528049](https://doi.org/10.1101%2F2023.02.10.528049). URL: [http://dx.doi.org/10.1101/2023.02.10.528049](http://dx.doi.org/10.1101/2023.02.10.528049). [5] G. Tonkin-Hill. _GitHub - gtonkinhill/TCGA\_analysis - github.com_. <https://github.com/gtonkinhill/TCGA_analysis>. [Accessed 21-06-2024]. 2023. [6] K. Sankaran. " Discovery and visualization of latent structure with applications to the microbiome". PhD thesis. Stanford University, 2018. URL: [https://purl.stanford.edu/nx110xz3452](https://purl.stanford.edu/nx110xz3452). [7] E. Pasolli et al. "Accessible, curated metagenomic data through ExperimentHub". In: _Nature Methods_ 14 (2017), pp. 1023-1024. URL: [https://api.semanticscholar.org/CorpusID:3403081](https://api.semanticscholar.org/CorpusID:3403081). [8] E. Muller et al. "The gut microbiome-metabolome dataset collection: a curated resource for integrative meta-analysis". In: _npj Biofilms and Microbiomes_ 8.1 (Oct. 2022). ISSN: 2055-5008. DOI: [10.1038/s41522-022-00345-5](https://doi.org/10.1038%2Fs41522-022-00345-5). URL: [http://dx.doi.org/10.1038/s41522-022-00345-5](http://dx.doi.org/10.1038/s41522-022-00345-5). [9] Felix G.M. Ernst <felix.gm.ernst@outlook.com> [aut, cre] (<https://orcid.org/0000-0001-5064-0928>), Leo Lahti [aut] (<https://orcid.org/0000-0001-5537-637X>), Sudarshan Shetty <sudarshanshetty9@gmail.com> [aut] (<https://orcid.org/0000-0001-7280-9915>). _microbiomeDataSets_. 2021. DOI: [10.18129/B9.BIOC.MICROBIOMEDATASETS](https://doi.org/10.18129%2FB9.BIOC.MICROBIOMEDATASETS). URL: [https://bioconductor.org/packages/microbiomeDataSets](https://bioconductor.org/packages/microbiomeDataSets). [10] _Home - National Microbiome Data Collaborative - microbiomedata.org_. <https://microbiomedata.org/>. [Accessed 17-02-2025]. [11] K. Sankaran et al. "Semisynthetic simulation for microbiome data analysis". En. In: _Brief. Bioinform._ 26.1 (Nov. 2024). [12] K. Sankaran et al. "Generative Models: An Interdisciplinary Perspective". In: _Annual Review of Statistics and Its Application_ 10.1 (Mar. 2023), p. 325–352. ISSN: 2326-831X. DOI: [10.1146/annurev-statistics-033121-110134](https://doi.org/10.1146%2Fannurev-statistics-033121-110134). URL: [http://dx.doi.org/10.1146/annurev-statistics-033121-110134](http://dx.doi.org/10.1146/annurev-statistics-033121-110134). [13] H. Jiang et al. "Multimedia: multimodal mediation analysis of microbiome data". In: _Microbiology Spectrum_ 13.2 (Feb. 2025). Ed. by J. Claesen. ISSN: 2165-0497. DOI: [10.1128/spectrum.01131-24](https://doi.org/10.1128%2Fspectrum.01131-24). URL: [http://dx.doi.org/10.1128/spectrum.01131-24](http://dx.doi.org/10.1128/spectrum.01131-24). --- class: reference ### References [14] Y. Benjamini et al. "Controlling the false discovery rate: a practical and powerful approach to multiple testing". In: _Journal of the Royal statistical society: series B (Methodological)_ 57.1 (1995), pp. 289-300. [15] B. Efron. _Large-scale inference: empirical Bayes methods for estimation, testing, and prediction_. Vol. 1. Cambridge University Press, 2012. [16] L. Lahti et al. "Tipping elements in the human intestinal ecosystem". In: _Nature Communications_ 5.1 (Jul. 2014). ISSN: 2041-1723. DOI: [10.1038/ncomms5344](https://doi.org/10.1038%2Fncomms5344). URL: [http://dx.doi.org/10.1038/ncomms5344](http://dx.doi.org/10.1038/ncomms5344). [17] D. Song et al. "scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics". En. In: _Nature Biotechnology_ 42.2 (May. 2023), p. 247–252. DOI: [10.1038/s41587-023-01772-1](https://doi.org/10.1038%2Fs41587-023-01772-1). URL: [http://dx.doi.org/10.1038/s41587-023-01772-1](http://dx.doi.org/10.1038/s41587-023-01772-1). [18] Y. Shen et al. "Estimating sparse direct effects in multivariate regression with the spike-and-slab LASSO". In: _Bayesian Anal._ -1.-1 (Jan. 2024). [19] Y. Shen et al. "The effect of the prior and the experimental design on the inference of the precision matrix in Gaussian chain graph models". En. In: _J. Agric. Biol. Environ. Stat._ (May. 2024). [20] K. Sankaran et al. "Latent variable modeling for the microbiome". In: _Biostatistics_ 20.4 (Jun. 2018), p. 599–614. ISSN: 1468-4357. DOI: [10.1093/biostatistics/kxy018](https://doi.org/10.1093%2Fbiostatistics%2Fkxy018). URL: [http://dx.doi.org/10.1093/biostatistics/kxy018](http://dx.doi.org/10.1093/biostatistics/kxy018). [21] R. Lenth. _Java applets for power and sample size - homepage.divms.uiowa.edu_. <https://homepage.divms.uiowa.edu/~rlenth/Power/index.html>. [Accessed 17-02-2025]. [22] F. Rohart et al. "mixOmics: An R package for 'omics feature selection and multiple data integration". En. In: _PLoS Comput. Biol._ 13.11 (Nov. 2017), p. e1005752. [23] J. Friedman et al. "Sparse inverse covariance estimation with the graphical lasso". En. In: _Biostatistics_ 9.3 (Jul. 2008), pp. 432-441. [24] T. Cai et al. "Adaptive Thresholding for Sparse Covariance Matrix Estimation". In: _J. Am. Stat. Assoc._ 106.494 (Jun. 2011), pp. 672-684. [25] J. Friedman. _On multivariate goodness-of-fit and two-sample testing_. Tech. rep. Citeseer, 2004. [26] D. Song et al. "PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data". In: _Genome biology_ 22.1 (2021), p. 124. [27] D. Song. "Improving Statistical Rigor in Single-Cell and Spatial Omics". PhD thesis. University of California, Los Angeles, 2024. [28] K. Lê Cao et al. "Community-wide hackathons to identify central themes in single-cell multi-omics". In: _Genome biology_ 22 (2021), pp. 1-21. --- class: reference ### References [29] B. W. Haak et al. "Integrative transkingdom analysis of the gut microbiome in antibiotic perturbation and critical illness". En. In: _mSystems_ 6.2 (Mar. 2021). [30] A. Singh et al. "DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays". In: _Bioinformatics_ 35.17 (Jan. 2019). Ed. by I. Birol, p. 3055–3062. ISSN: 1367-4811. DOI: [10.1093/bioinformatics/bty1054](https://doi.org/10.1093%2Fbioinformatics%2Fbty1054). URL: [http://dx.doi.org/10.1093/bioinformatics/bty1054](http://dx.doi.org/10.1093/bioinformatics/bty1054). [31] L. H. Morais et al. "The gut microbiota-brain axis in behaviour and brain disorders". In: _Nature Reviews Microbiology_ 19.4 (2021), pp. 241-255. [32] J. A. Bosch et al. "The gut microbiota and depressive symptoms across ethnic groups". En. In: _Nature Communications_ 13.1 (Dec. 2022), p. 7129. ISSN: 2041-1723. DOI: [10.1038/s41467-022-34504-1](https://doi.org/10.1038%2Fs41467-022-34504-1). URL: [https://www.nature.com/articles/s41467-022-34504-1](https://www.nature.com/articles/s41467-022-34504-1) (visited on 12/11/2022). [33] J. A. Foster et al. "Gut-brain axis: how the microbiome influences anxiety and depression". In: _Trends in Neurosciences_ 36.5 (May. 2013), pp. 305-312. [34] P. Zheng et al. "The gut microbiome modulates gut-brain axis glycerophospholipid metabolism in a region-specific manner in a nonhuman primate model of depression". In: _Molecular psychiatry_ 26.6 (2021), pp. 2380-2392. [35] D. W. Grupe et al. "The Impact of Mindfulness Training on Police Officer Stress, Mental Health, and Salivary Cortisol Levels". In: _Frontiers in Psychology_ 12 (Sep. 2021). ISSN: 1664-1078. DOI: [10.3389/fpsyg.2021.720753](https://doi.org/10.3389%2Ffpsyg.2021.720753). URL: [http://dx.doi.org/10.3389/fpsyg.2021.720753](http://dx.doi.org/10.3389/fpsyg.2021.720753). --- ### Evaluation Taxonomy This is how some common techniques fall into this taxonomy. * **Graphical, Narrow**: Boxplots or cumulative distribution function plots comparing real vs. simulated taxa (like the DA example). * **Graphical, Broad**: Principal component plots of real vs. simulated dataset. * **Quantitative, Narrow**: Two-sample Kolmogorov-Smirnov test. * **Quantitative, Broad**: Evaluation through classification (next example). * **Fit-for-Purpose**: Linear model coefficients on real vs. simulated data ([11], "Batch Effect Correction"). --- ### SPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figures/PLS-step1.png" width=500/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### SPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figures/PLS-step2.png" width=240/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### SPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figures/PLS-step3.png" width=400/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### SPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figures/PLS-step4.png" width=150/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### SPLS-DA Intuition Now we can compare samples from the two tables in a single, shared space. .center[ <img src="figures/PLS-step5.png" width=800/> ] --- ### SPLS-DA Intuition Now we can compare samples from the two tables in a single, shared space. .center[ <img src="figures/PLS-step6.png" width=800/> ] --- ### SPLS-DA Intuition To get more than one dimension, we can repeat this process after removing any correlation with previously found patterns. .center[ <img src="figures/PLS-step7.png" width=800/> ] --- ### Copula Models More formally, let `\(F_{1}, \dots, F_{D}\)` be the target margins and let `\(\Phi\)` be the CDF of the Gaussian distribution. Gaussian Copula modeling has these steps. Estimate: 1. Gaussianize the observed `\(\mathbf{x}_{i}\)` to `\(\mathbf{z}_{i} := \left[\Phi^{-1}\left(F_{1}\left(x_{i1}\right)\right), \dots, \Phi^{-1}\left(F_{D}\left(x_{iD}\right)\right)\right]\)` 1. Estimate the covariance `\(\hat{\Sigma}\)` associated with `\(z_{i}\)` Simulate: 1. Draw `\(\mathbf{z}^\ast \sim \mathcal{N}\left(0, \Sigma\right)\)` 1. Transform back `\(\mathbf{x}^{\ast} := \left[F_{1}^{-1}\left(\Phi\left(z_{i1}^\ast\right)\right), \dots, F_{D}^{-1}\left(\Phi\left(z_{iD}^\ast\right)\right)\right]\)` --- <img src="figures/correlation_histogram.png" width=700/> --- <img src="figures/covariance_hyperparameter_errors.png" width=700/> --- ### Acknowledgments frustration by Rikas Dzihab from <a href="https://thenounproject.com/browse/icons/term/frustration/" target="_blank" title="frustration Icons">Noun Project</a> (CC BY 3.0) confused by Rikas Dzihab from <a href="https://thenounproject.com/browse/icons/term/confused/" target="_blank" title="confused Icons">Noun Project</a> (CC BY 3.0)