class: title # Towards Veridical Microbiome Analysis through Semisynthetic Data <style> .slide-background { background: url("figures/cover.png") no-repeat center center; background-size: cover; opacity: 0.5; } </style> <div id="subtitle_left"> Slides: <a href="https://go.wisc.edu/5q8xvl">go.wisc.edu/5q8xvl</a><br/> Paper: <a href="https://go.wisc.edu/p12o8w">go.wisc.edu/p12o8w</a><br/> Lab: <a href="https://measurement-and-microbes.org">measurement-and-microbes.org</a> <br/> </div> <div id="subtitle_right"> Kris Sankaran <br/> <a href="https://www.eventbrite.com/e/veridical-data-science-for-biology-2025-tickets-1384456339179/">VDSB 2025</a><br/> 11 | July | 2025 <br/> </div> <!-- 20 minute talk, 3 min Q&A --> --- ### An Early Simulation Study The supplement to [1] considers some experimental design questions, > [We] discussed the motivation behind dividing each subject into treatment and control timepoints, rather than allocating separate study subjects as controls... To quantitatively characterize the impact of this choice, we performed this simulation experiment... .center[ <img src="figures/multidomain_title.png" width=630/> ] --- How do internal and external controls compare in longitudinal microbiome analysis? - Rows: Increasing subject-to-subject variability. - Columns: Increasing perturbation effect size. .center[ <img src="figures/multidomain_simulation.png" width=1200/> ] --- Sometimes it takes a long time to appreciate the things you understood as a beginner... .center[ <img src="figures/multidomain_title-date.png" width=800/> ] --- ### Why Simulate? Simulations underlie the skills microbiome statisticians must master [2; 3]. They help us to... <img src="figures/noun-labs-99456.png" width=40/> **Design experiments** that have high power to detect subtle signals. <br/> <img src="figures/noun-benchmark-7569457.png" width=40/> **Benchmark methods** and identify gaps in the literature. <br/> <img src="figures/noun-checkmark-7518321.png" width=40/> **Check conclusions** that might be sensitive to technical processing steps. It's worthwhile to treat simulation design with the same formality as methods development. --- ### Semisynthetic Data One of the major advances has been the design of algorithms that can leverage public data resources, like [4; 5; 6; 7]. * **Semisynthetic Data**: The output from a simulator that has been designed to mimic external, template data. * **Template Data**: Previously gathered experimental data that can be used to train a simulator [3]. .center[ <img src="figures/template_defn.png" width=665/> ] --- ### Existing Interfaces This is from the [`splatter` introductory vignette](https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatter.html#2_Quickstart) [8]. It has a thoughtfully designed interface and has been used to critique microbiome studies [9]. ``` r params <- setParams(params, mean.shape = 0.5, de.prob = 0.2) params #> A Params object of class SplatParams #> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT' #> Secondary parameters are usually set during simulation #> #> Global: #> (GENES) (Cells) [SEED] #> 8000 100 694289 #> #> 29 additional parameters #> #> Batches: #> [Batches] [Batch Cells] [Location] [Scale] [Remove] #> 1 100 0.1 0.1 FALSE #> #> Mean: #> (RATE) (SHAPE) #> 0.5 0.5 ``` --- ### Existing Interfaces .pull-left[ Here is an example from [`scDesign3`](https://songdongyuan1994.github.io/scDesign3/docs/articles/scDesign3.html) [10]. This package has the most generally applicable statistical methodology. ] .pull-right[ ``` r example_simu <- scdesign3( sce = example_sce, assay_use = "counts", celltype = "cell_type", pseudotime = "pseudotime", spatial = NULL, other_covariates = NULL, mu_formula = "s(pseudotime, k = 10, bs = 'cr')", sigma_formula = "1", # If you want your dispersion also varies along pseudotime, use "s(pseudotime, k = 5, bs = 'cr')" family_use = "nb", n_cores = 2, usebam = FALSE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE ) ``` ] --- ### Challenges Simulation interfaces often suffer from what researchers in human-computer interfaces call interaction gulfs [11]. It is hard to quickly specify a data generating process, evaluate its properties, and then iterate. .center[ <img src="figures/interaction_gulfs.png" width=800/> ] --- <g style="font-size: 36px"> <b>Main Idea: Apply interactive computing principles to multi-omics simulation.</b> </g> <img src="figures/Lego_Brick_4.svg" width=50/> <span style="color:#025E73">Modularity</span>: We should be able to build problem-specific simulators by composing simple pieces. <img src="figures/computer_mouse.png" width=35/> <span style="color:#D94E4E">Interactivity</span>: We should give domain researchers agency in designing, evaluating, and modifying statistical hypotheses. <img src="figures/speech_bubbles.png" width=70/> <span style="color:#378C5C">Cooperation</span>: Science is a social activity, and good tools allow researchers to build upon one another's work. --- ### Related: simChef The simChef package [12] simplifies simulator specification and execution through a tidy interface and parallel computation. Our study focuses on how to create DGP modules for high-dimensional omics applications. .center[ <img src="figures/simchef_title.png" width=700/> ] --- ### `scDesign3` Review The rest of the talk will be about preparing a tidy-like interface fo `scDesign3` [10]. Let's review that package's approach. .center[ <img src="figures/scdesignOverview.png" width=800/> ] First, we estimate models `\(\hat{F}_{g}\left(y_{i} \vert \mathbf{x}_{i}\right)\)` for each gene `\(g\)`. * Can use a variety of families: Gaussian, Poisson, Negative Binomial,... * Can learn relationships for each parameter `\(\theta\left(\mathbf{x}_{i}\right)\)`. --- ### `scDesign3` Review .pull-left[ <img src="figures/scdesignOverview2.png" width=400/> ] .pull-right[ 1. Next, we model the joint distribution of quantiles using a copula model. 1. This correlates genes even after conditioning on the same `\(\mathbf{x}_{i}\)`. ] --- ### Nouns & Verbs We can break the interface design question into two parts. 1. **Data Structures**: How can we represent the simulator in a way that is both transparent to a human and precise enough for a computer? 1. **Operations**: How can we encourage users to study and tinker with the data structures? If the resulting grammar is expressive enough, then researchers will be able to solve problems we may not have anticipated. --- ### Verbs: <span style="color:#025E73">Mutate</span> 1. This example is from a longitudinal microbiome study about the proteins present in a mouse model of Huntington's Disease [13]. 2. We can use `mutate` to define a synthetic null with no disease effect for a known subset of genes. .pull-three-quarters-left[ <img src="figures/nulls_unaltered.png"/> ] .pull-three-quarters-right[ <img src="figures/pairwise_cors.png"/> ] --- ### Verbs: <span style="color:#025E73">Mutate</span> 1. This example is from a longitudinal microbiome study about the proteins present in a mouse model of Huntington's Disease [13]. 2. We can use `mutate` to define a synthetic null with no disease effect for a known subset of genes. .pull-three-quarters-left[ <img src="figures/altered_ns.png"/> ] .pull-three-quarters-right[ <img src="figures/pairwise_cors_altered.png"/> ] --- ### Verbs: <span style="color:#025E73">Join</span> We should make it possible to combine simulators like Lego blocks. <img src="figures/simulator_join_motivation.png"/> --- ### Verbs: <span style="color:#025E73">Join</span> (Copula) One approach is to merge the list of marginal distributions and re-estimate the joint distribution. This assumes that we have samples where all features are measured. .center[ <img src="figures/copula_join.png" width = 950/> ] --- ### Verbs: <span style="color:#025E73">Join</span> (Conditioning) Alternatively, we can combine two simulators by conditioning them on shared latent structure. .center[ <img src="figures/simulator_join.png" width = 1000/> ] --- ### Example: Microbiome Network Inference 1. Benchmarking methods for microbiome network inference is challenging. We can't directly observe microbe-microbe interactions, which stands in the way of ground truth labeling. 1. We use the following choices in the design of our simulation, - **Template**: American Gut Project (261 samples, 45 most abundant taxa) - **Estimator**: Zero-Inflated Negative Binomial Copula - **Goodness-of-Fit**: Graphical Checks - **Ground Truth**: Copula correlation matrix - **Summarization**: Estimated vs. ground-truth correlation --- ### Simulation Mechanism Here are samples from the Zero-Inflated Negative Binomial Copula with the covariates `~ log(sequencing_depth) + BMI`. Each panel compares real vs. simulated data for one taxon. .center[ <img src="figures/zinb_marginals.png" width=700/> ] --- ### Estimated Correlation The resulting correlation estimate lacks substantial banding or blocks. We can also sanity check some of the highly correlated pairs. .pull-left[ <img src="figures/estimated_correlation.png" width=900/> ] .pull-right[ <img src="figures/correlated_pair.png" width=400/> ] --- ### Establishing Ground Truth To create a basis for methods comparison, we modified the ZINB copula to use a block correlation matrix with varying intra-block correlation strength. .center[ <img src="figures/ground_truth_corr.png" width=450/> ] --- ### Methods Comparison We compared SpiecEasi [14], a method designed for microbiome networks, with the Ledoit-Wolf estimator [15] on `\(\log\left(1 + x\right)\)` transformed counts. .center[ <img src="figures/comparison_difference.png" width=800/> ] --- ### Methods Comparison We compared SpiecEasi [14], a method designed for microbiome networks, with the Ledoit-Wolf estimator [15] on `\(\log\left(1 + x\right)\)` transformed counts. .center[ <img src="figures/correlation_comparison_scatter.png" width=700/> ] --- ### Generalizing In our review, we consider a wider range of underlying correlation matrices. .center[ <img src="figures/network_comparison.png" width=1200/> ] --- ### Software and Resources All the examples I discussed today can be run from online tutorials: * Simulation for Microbiome Analysis ([go.wisc.edu/wnj5p9](https://go.wisc.edu/wnj5p9)) * Generative Models Examples ([go.wisc.edu/ax73qb](https://go.wisc.edu/ax73qb)) Our workshop materials are online: * UW-Madison Plant Pathology [slides](https://krisrs1128.github.io/talks/2024/20240207/20240207.html#1), [colab](https://colab.research.google.com/drive/1IyMEQJwkslPzL9FYd5atvyGORqW9IrCI?usp=sharing) * UniMelb Integrative Genomics [notebooks](https://github.com/krisrs1128/intro-to-simulation/), slides [1](https://go.wisc.edu/54tmr9), [2](https://go.wisc.edu/rc776i), [3](https://go.wisc.edu/gfj36r). The relevant R packages behind these analyses are: * `multimedia` - Mediation analysis for microbiome data [16]. * `scDesign3` - An existing simulator for single cell data [17; 18; 10]. * `scDesigner` - Under-development version used in the first tutorial. --- ### Thank you! * Contact: ksankaran@wisc.edu * Lab Members: Margaret Thairu, Shuchen Yan, Yuliang Peng, Langtian Ma, Helena Huang * Funding: NIGMS R01GM152744, NIAID R01AI184095, Gates 072185 --- class: reference ### References [1] J. Fukuyama et al. "Multidomain analyses of a longitudinal human microbiome intestinal cleanout perturbation experiment". In: _PLOS Computational Biology_ 13.8 (Aug. 2017). Ed. by E. Borenstein, p. e1005706. ISSN: 1553-7358. DOI: [10.1371/journal.pcbi.1005706](https://doi.org/10.1371%2Fjournal.pcbi.1005706). URL: [http://dx.doi.org/10.1371/journal.pcbi.1005706](http://dx.doi.org/10.1371/journal.pcbi.1005706). [2] G. D. Poore et al. "RETRACTED ARTICLE: Microbiome analyses of blood and tissues suggest cancer diagnostic approach". In: _Nature_ 579.7800 (Mar. 2020), p. 567–574. ISSN: 1476-4687. DOI: [10.1038/s41586-020-2095-1](https://doi.org/10.1038%2Fs41586-020-2095-1). URL: [http://dx.doi.org/10.1038/s41586-020-2095-1](http://dx.doi.org/10.1038/s41586-020-2095-1). [3] A. Gihawi et al. "Major data analysis errors invalidate cancer microbiome findings". In: _mBio_ 14.5 (Oct. 2023). Ed. by I. B. Zhulin. ISSN: 2150-7511. DOI: [10.1128/mbio.01607-23](https://doi.org/10.1128%2Fmbio.01607-23). URL: [http://dx.doi.org/10.1128/mbio.01607-23](http://dx.doi.org/10.1128/mbio.01607-23). [4] G. D. Sepich-Poore et al. "Reply to: Caution Regarding the Specificities of Pan-Cancer Microbial Structure". (Feb. 2023). DOI: [10.1101/2023.02.10.528049](https://doi.org/10.1101%2F2023.02.10.528049). URL: [http://dx.doi.org/10.1101/2023.02.10.528049](http://dx.doi.org/10.1101/2023.02.10.528049). [5] G. Tonkin-Hill. _GitHub - gtonkinhill/TCGA\_analysis - github.com_. <https://github.com/gtonkinhill/TCGA_analysis>. [Accessed 21-06-2024]. 2023. [6] K. Sankaran et al. "Generative Models: An Interdisciplinary Perspective". In: _Annual Review of Statistics and Its Application_ 10.1 (Mar. 2023), p. 325–352. ISSN: 2326-831X. DOI: [10.1146/annurev-statistics-033121-110134](https://doi.org/10.1146%2Fannurev-statistics-033121-110134). URL: [http://dx.doi.org/10.1146/annurev-statistics-033121-110134](http://dx.doi.org/10.1146/annurev-statistics-033121-110134). [7] K. Sankaran et al. "Semisynthetic simulation for microbiome data analysis". En. In: _Brief. Bioinform._ 26.1 (Nov. 2024). [8] E. Pasolli et al. "Accessible, curated metagenomic data through ExperimentHub". In: _Nature Methods_ 14 (2017), pp. 1023-1024. URL: [https://api.semanticscholar.org/CorpusID:3403081](https://api.semanticscholar.org/CorpusID:3403081). [9] E. Muller et al. "The gut microbiome-metabolome dataset collection: a curated resource for integrative meta-analysis". In: _npj Biofilms and Microbiomes_ 8.1 (Oct. 2022). ISSN: 2055-5008. DOI: [10.1038/s41522-022-00345-5](https://doi.org/10.1038%2Fs41522-022-00345-5). URL: [http://dx.doi.org/10.1038/s41522-022-00345-5](http://dx.doi.org/10.1038/s41522-022-00345-5). [10] Felix G.M. Ernst <felix.gm.ernst@outlook.com> [aut, cre] (<https://orcid.org/0000-0001-5064-0928>), Leo Lahti [aut] (<https://orcid.org/0000-0001-5537-637X>), Sudarshan Shetty <sudarshanshetty9@gmail.com> [aut] (<https://orcid.org/0000-0001-7280-9915>). _microbiomeDataSets_. 2021. [11] _Home - National Microbiome Data Collaborative - microbiomedata.org_. <https://microbiomedata.org/>. [Accessed 17-02-2025]. [12] L. Zappia et al. "Splatter: simulation of single-cell RNA sequencing data". In: _Genome Biology_ 18 (2017). URL: [https://api.semanticscholar.org/CorpusID:9332625](https://api.semanticscholar.org/CorpusID:9332625). [13] D. Song et al. "scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics". In: _Nature Biotechnology_ (2023), pp. 1-6. URL: [https://api.semanticscholar.org/CorpusID:258638565](https://api.semanticscholar.org/CorpusID:258638565). --- class: reference ### References [14] E. L. Hutchins et al. "Direct manipulation interfaces". En. In: _Hum.-Comput. Interact._ 1.4 (Dec. 1985), pp. 311-338. [15] J. Duncan et al. "simChef: High-quality data science simulations in R". In: _Journal of Open Source Software_ 9.95 (Mar. 2024), p. 6156. ISSN: 2475-9066. DOI: [10.21105/joss.06156](https://doi.org/10.21105%2Fjoss.06156). URL: [http://dx.doi.org/10.21105/joss.06156](http://dx.doi.org/10.21105/joss.06156). [16] K. Cao et al. "Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona". In: _Bioinformatics_ 38 (2020), pp. 211 - 219. URL: [https://api.semanticscholar.org/CorpusID:226959190](https://api.semanticscholar.org/CorpusID:226959190). [17] C. Dotsey. _Everybody Dance Now: Tchaikovsky's Sleeping Beauty - houstonsymphony.org_. <https://houstonsymphony.org/tchaikovsky-sleeping-beauty/>. [Accessed 01-07-2025]. [18] Z. D. Kurtz et al. "Sparse and compositionally robust inference of microbial ecological networks". En. In: _PLoS Comput. Biol._ 11.5 (May. 2015), p. e1004226. [19] O. Ledoit et al. "A well-conditioned estimator for large-dimensional covariance matrices". In: _Journal of Multivariate Analysis_ 88.2 (Feb. 2004), p. 365–411. ISSN: 0047-259X. DOI: [10.1016/s0047-259x(03)00096-4](https://doi.org/10.1016%2Fs0047-259x%2803%2900096-4). URL: [http://dx.doi.org/10.1016/S0047-259X(03)00096-4](http://dx.doi.org/10.1016/S0047-259X(03)00096-4). [20] H. Jiang et al. "Multimedia: multimodal mediation analysis of microbiome data". In: _Microbiology Spectrum_ 13.2 (Feb. 2025). Ed. by J. Claesen. ISSN: 2165-0497. DOI: [10.1128/spectrum.01131-24](https://doi.org/10.1128%2Fspectrum.01131-24). URL: [http://dx.doi.org/10.1128/spectrum.01131-24](http://dx.doi.org/10.1128/spectrum.01131-24). [21] W. V. Li et al. "A statistical simulator scDesign for rational scRNA-seq experimental design". In: _Bioinformatics_ 35.14 (Jul. 2019), p. i41–i50. ISSN: 1367-4811. DOI: [10.1093/bioinformatics/btz321](https://doi.org/10.1093%2Fbioinformatics%2Fbtz321). URL: [http://dx.doi.org/10.1093/bioinformatics/btz321](http://dx.doi.org/10.1093/bioinformatics/btz321). [22] T. Sun et al. "scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured". In: _Genome Biology_ 22.1 (May. 2021). ISSN: 1474-760X. DOI: [10.1186/s13059-021-02367-2](https://doi.org/10.1186%2Fs13059-021-02367-2). URL: [http://dx.doi.org/10.1186/s13059-021-02367-2](http://dx.doi.org/10.1186/s13059-021-02367-2). --- ### Nouns: Gene-Level Models For each gene, we can specify the regression formula and distributional family. ``` ## Plan: ## # A tibble: 6 × 3 ## feature family link ## <gene_id> <distn> <link> ## 1 Pyy ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## 2 Iapp ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## 3 Chgb ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## 4 Rbp4 ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## 5 Spp1 ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## 6 Chga ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## Pyy, Iapp, Chgb, and 7 other features need fitting. ## Estimates: ## # A tibble: 0 × 0 ``` --- ### Nouns: Copula Models We can tie together a collection of marginals using a copula model. .center[ <img src="figures/gene-gene_dependence.png" width=900/> ] --- ### Data Analysis Controversy .center[ <img src="figures/retraction.png"/> ] In June 2024, _Nature_ retracted a paper [19] that claimed identify microbiome signatures of cancer. This came after one year's worth of debate [20; 21] about the data analysis. --- ### Data Analysis Controversy .center[ <img src="figures/retraction.png"/> ] The "disease signature" was an artifact resulting from the use of a batch effect correction method. Before we can understand the nuances of the story, we need to learn about batch effects and correction methods. --- ### Simulation to Resolve Controversy Gerry Tonkin-Hill has an excellent re-analysis [9] of the data from [19] which sheds light what was likely the source of the phantom signals. The first part is a simulation. .pull-left[ <img src="figures/gth_sim1.png"/> ] .pull-right[ <img src="figures/gth_sim2.png"/> ] --- ### Simulation and Supervised Normalization This setting is balanced -- each condition is equally likely across batches. In this case, SVN batch effect correction works well. .pull-left[ <img src="figures/gth_sim3.png"/> ] .pull-right[ <img src="figures/gth_sim4.png"/> ] --- ### Simulation and Supervised Normalization But what happens if there is imbalance? .pull-left[ <img src="figures/gth_sim5.png"/> ] .pull-right[ <img src="figures/gth_sim6.png"/> ] --- ### Simulation and Supervised Normalization In this case, the SVN correction introduces an artificial difference. .center[ <img src="figures/gth_sim7.png" width=500/> ] This should cause concern about the original analysis: Hospitals specialize in cancer types. Then again, this simulation is quite unrealistic. --- ### Future Work We've led workshops to get feedback on our package, and two questions repeatedly arise: 1. Is there any way to make this faster? 1. How can I know when my simulator is good enough? We've been adapting literature on scalable GLM modeling and model diagnostics [2] to answer these questions. --- ### Bigger Picture 1. Communicating hypothetical experimental outcomes effectively is one of the biggest challenges of any collaborative computational biology project. Simulation can help! 1. In statistics research, simulation is at once familiar yet underappreciated. Analogy... > That a composer of symphonies and operas would consider a ballet score his best work was remarkable for the time... ballet was something of a neglected stepchild as far as the musical world was concerned. -- From [22]