class: title <div id="title"> Semisynthetic Simulation for Biological Data Analysis </div> <div id="under_title"> Session 1: Marginal Modeling </div> <div id="subtitle"> Kris Sankaran <br/> 11 | June | 2024 <br/> Lab: <a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> <br/> </div> <div id="subtitle_right"> Melbourne Integrative Genomics<br/> Slides: <a href="https://go.wisc.edu/gfj36r">go.wisc.edu/gfj36r</a><br/> Code: <a href="https://go.wisc.edu/o5sn6w">go.wisc.edu/o5sn6w</a> </div> --- ### Overall Learning Outcomes By the end of this course, you will be able to... 1. Describe simulators: What are their building blocks and core properties? 1. Apply simulators: Know how to use simulation at different stages of the multi-omics data analysis to guide workflow design and interpretation. 1. Critique simulators: Compare and contrast real with simulated data using effective data visualizations. --- ### Tody: Marginal Simulation <span style="color:#8C1F33">Marginal</span> `\(\to\)` Multivariate `\(\to\)` Integrative .center[ <img src="figure/integration_types_1a.png" width=600/> ] --- ### Tody: Marginal Simulation <span style="color:#8C1F33">Marginal</span> `\(\to\)` Multivariate `\(\to\)` Integrative .center[ <img src="figure/integration_types_1b.png" width=600/> ] --- ### Tody: Marginal Simulation <span style="color:#8C1F33">Marginal</span> `\(\to\)` Multivariate `\(\to\)` Integrative .center[ <img src="figure/integration_types_1c.png" width=600/> ] --- ### Today's Learning Outcomes By the end of this session, you will be able to... 1. Manipulate and interpret `SummarizedExperiment` experiment objects. 1. Design a power analysis for differential testing and accurately communicate its results. 1. Identify areas of your own research where simulation could help better allocate limited resources. --- ### Course Expectations * Bugs are normal! Resolving them is a skill you will develop. * Please ask a tutor to help at any point. Don't worry about interrupting. * We will have breaks and discussions. Get to know the others in the course! --- class: middle .center[ ## Scientific Context ] --- ### The Microbiome Imagine a collaboration with researchers who study the human gut microbiome -- the ecosystem of microorganisms that live in the gut [1]. Like ordinary ecology, they want to know: .pull-left[ * Who is present? * What are they doing? Which genes are active? * How does this depend on the host or environmental context? ] .pull-right[ <div class="figure" style="text-align: center"> <img src="https://whatislife.stanford.edu/images/spatial.png" alt="The microbiome along the gut lining, from [2]." width="250" /> <p class="caption">The microbiome along the gut lining, from [2].</p> </div> ] --- ### Hypothetical Proposal They are preparing a grant proposal about how community composition is related to nutrition. The want to compare the microbiomes associated with malnutrition and health. .center[ <img src="figure/power_curve.png" width=600px/> ] <span style="position: absolute; bottom: 20px"> We need to make sure the budget is wisely while ensuring the study isn't underpowered. </span> --- ### Problem Setup This is a differential abundance problem [3; 4]. There are many methods out there that account for: * Potentially extreme sparsity and non-Gaussianity. * Uneven sequencing depth across samples. * The large number of tests (and the potential to borrow strength). This is great news but means we can't just use `\(t\)`-test power calculators. --- ### Approach: Simulation Instead, we simulate. We will simulate experiments and see what our power and false discovery rate (FDR) look like in those hypothetical datasets. Abstract power calculation `\(\to\)` Concrete computational experimentation. .center[ <img src="figure/concrete_power.png" width=800/> ] --- ### Simulation with Templates * Instead of doing this from scratch, we train a generative model to existing experimental data ("template data") [5; 6]. * Generative models can _generate_ new, hypothetical samples, not just fit observed ones [7]. .center[ <img src="figure/discriminative_vs_generative.png" width=700/> ] --- ### Simulation with Templates * Instead of doing this from scratch, we train a generative model to existing experimental data ("template data") [5; 6]. * Generative models can _generate_ new, hypothetical samples, not just fit observed ones [7]. .center[ <img src="figure/generative_example.png" width=900/> ] --- ### Simulation with Templates Abstract power calculation `\(\to\)` Concrete computational experimentation. .center[ <img src="figure/power_overview.png" width=890/> ] --- ### Discussion Please discuss in groups of 2 - 4: * In your own research, what is one place where simulation might help? * What template dataset would you use to guide the simulation? We will debrief responses as a group. --- class: middle .center[ ## Statistical Concepts ] --- ### Approach The richer our vocabulary for generative models, the better chance we'll have of finding a realistic simulation mechanism. Most modern methods have three parts. 1. **A template dataset on which to base the simulator.** 1. Flexible families of probability distributions. 1. Regressions to relate samples to experimental or biological factors. .center[ <img src="figure/summarized_experiment_focus.png" width=150/> ] --- ### Approach The richer our vocabulary for generative models, the better chance we'll have of finding a realistic simulation mechanism. Most modern methods have three parts. 1. A template dataset on which to base the simulator. 1. **Flexible families of probability distributions.** 1. Regressions to relate samples to experimental or biological factors. .center[ <img src="figure/distributions.png" width=700/> ] --- ### Approach The richer our vocabulary for generative models, the better chance we'll have of finding a realistic simulation mechanism. Most modern methods have three parts. 1. A template dataset on which to base the simulator. 1. Flexible families of probability distributions. 1. **Regressions to relate samples to experimental or biological factors.** .center[ <img src="figure/marginal_dependence.png" width=500/> ] --- ### Representing Data .pull-left[ We can tie together sequencing output, experimental design, and biological annotation using `SummarizedExperiment` objects [8]. This will let us concisely learn: Design + Biology `\(\to\)` Sequencing. ] .pull-right[ <img src="figure/summarized_experiment.svg" width=880/> ] --- ### Important Functions To work with `SummarizedExperiment` objects, we can use: * `assay`: Returns a matrix whose rows are sequencing features (e.g., genes, taxa, ...) and whose columns are samples. * `rowData`: Returns a data.frame that annotates each sequencing features. * `colData`: Returns a data.frame that annotates each sample. ``` r library(MIGsim) data(atlas) head(colData(atlas)) ``` ``` ## DataFrame with 6 rows and 11 columns ## age sex nationality DNA_extraction_method project diversity bmi_group subject time sample log_depth ## <integer> <factor> <factor> <factor> <factor> <numeric> <factor> <factor> <numeric> <character> <numeric> ## Sample-1 28 male US NA 1 5.76 obese 1 0 Sample-1 8.93498 ## Sample-2 24 female US NA 1 6.06 obese 2 0 Sample-2 9.22503 ## Sample-3 52 male US NA 1 5.50 lean 3 0 Sample-3 8.87221 ## Sample-5 25 female US NA 1 5.89 lean 5 0 Sample-5 9.39266 ## Sample-6 42 male US NA 1 5.53 lean 6 0 Sample-6 8.97639 ## Sample-8 27 female US NA 1 5.38 lean 8 0 Sample-8 9.07154 ``` --- ### Definition We can use `setup_simulator()` to define a new simulator. This requires: * The `SummarizedExperiment` template data * A regression formula relating colData features to the parameters * The probability model to use (e.g,. Gaussian or Poisson) ``` r library(scDesigner) library(gamboostLSS) sim <- setup_simulator(exper_ts, ~ group, ~ GaussianLSS()) ``` --- ### Alteration We can modify a simulator using the `mutate` command. ``` r sim |> mutate(1:3, link = ~ group + time) ``` ``` ## [Marginals] ## Plan: ## # A tibble: 6 × 3 ## feature family link ## <gene_id> <distn> <link> ## 1 1 Gaussian [mu,sigma] ~group + time ## 2 2 Gaussian [mu,sigma] ~group + time ## 3 3 Gaussian [mu,sigma] ~group + time ## 4 4 Gaussian [mu,sigma] ~group ## 5 5 Gaussian [mu,sigma] ~group ## 6 6 Gaussian [mu,sigma] ~group ## 1, 2, 3, and 3 other features need fitting. ## Estimates: ## # A tibble: 0 × 0 ## ## [Dependence] ## 0 NULLs with features ## ## [Template Data] ## class: SummarizedExperiment ## dim: 6 500 ## metadata(0): ## assays(1): counts ## rownames(6): 1 2 ... 5 6 ## rowData names(0): ## colnames: NULL ## colData names(2): group time ``` A more realistic example would be to switch to a zero inflated negative binomial for rare species: ``` r sim |> mutate(any_of(rare_taxa), family = ~ ZINBLSS()) ``` --- ### Estimation & Sampling Once a simulator is defined, it can be estimated with `estimate`. ``` r sim <- sim |> estimate(nu = 0.1) # learning rate = 0.1 ``` We can simulate new experiments using `sample`. ``` r sample(sim) ``` ``` ## class: SummarizedExperiment ## dim: 6 500 ## metadata(0): ## assays(1): counts_1 ## rownames(6): 1 2 ... 5 6 ## rowData names(0): ## colnames: NULL ## colData names(2): group time ``` --- ### New Data Alternatively, we can use a new `colData` object. This is useful for comparing sample sizes and experimental designs. ``` r new_design <- expand.grid( group = c("A", "B"), time = seq(0, 1, 0.1) ) sample(sim, new_data = new_design) ``` ``` ## class: SummarizedExperiment ## dim: 6 22 ## metadata(0): ## assays(1): counts_1 ## rownames(6): 1 2 ... 5 6 ## rowData names(0): ## colnames: NULL ## colData names(2): group time ``` --- class: middle .center[ ## Demo + Exercises ] * Code repository: [go.wisc.edu/o5sn6w](https://go.wisc.edu/o5sn6w) * Complete Solutions: [go.wisc.edu/v986n5](https://go.wisc.edu/v986n5) * Live Demo: [go.wisc.edu/0vr50d](https://go.wisc.edu/0vr50d) --- ### Summary 1. Guiding philosophy: * In experimental biology, controls can help validate data generation. * In computational biology, simulation can help validate statistical claims. 1. Power and FDR rates from simulated experiments can guide concrete, computational power analysis. 1. `SummarizedExperiment` streamlines interaction with experimental data. `assay()`, `rowData()`, and `colData()` are useful accessors. 1. To define a simulator in `scDesigner`, use `setup_simulator()` with R's formula syntax, like `~ treatment + time`. --- ### Next Time: Multivariate Simulation Marginal `\(\to\)` <span style="color:#8C1F33">Multivariate</span> `\(\to\)` Integrative .center[ <img src="figure/integration_types_2.png" width=600/> ] --- ### References [1] J. A. Gilbert, M. J. Blaser, J. G. Caporaso, J. K. Jansson, S. V. Lynch, and R. Knight. "Current understanding of the human microbiome". En. In: _Nat. Med._ 24.4 (Apr. 2018), pp. 392-400. [2] K. A. Earle, G. Billings, M. Sigal, J. S. Lichtman, G. C. Hansson, J. E. Elias, M. R. Amieva, K. C. Huang, and J. L. Sonnenburg. "Quantitative imaging of gut Microbiota spatial organization". En. In: _Cell Host Microbe_ 18.4 (Oct. 2015), pp. 478-488. [3] H. Li and H. Li. "Introduction to special issue on statistics in microbiome and metagenomics". En. In: _Stat. Biosci._ 13.2 (Jul. 2021), pp. 197-199. [4] L. Waldron. "Data and statistical methods to analyze the human microbiome". En. In: _mSystems_ 3.2 (Apr. 2018). --- [5] T. Sun, D. Song, W. V. Li, and J. J. Li. "scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured". En. In: _Genome Biol._ 22.1 (May. 2021), p. 163. [6] D. Song, Q. Wang, G. Yan, T. Liu, T. Sun, and J. J. Li. "scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics". En. In: _Nat. Biotechnol._ 42.2 (Feb. 2024), pp. 247-252. [7] K. Sankaran and S. P. Holmes. "Generative Models: An Interdisciplinary Perspective". In: _Annual Review of Statistics and Its Application_ 10.1 (Mar. 2023), p. 325–352. ISSN: 2326-831X. DOI: [10.1146/annurev-statistics-033121-110134](https://doi.org/10.1146%2Fannurev-statistics-033121-110134). [8] W. Huber, V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, S. Davis, L. Gatto, T. Girke, et al. "Orchestrating high-throughput genomic analysis with Bioconductor". En. In: _Nat. Methods_ 12.2 (Feb. 2015), pp. 115-121. --- ---