class: title background-size: cover <div id="links"> Melbourne Integrative Genomics Seminar <br/> Slides: <a href="https://go.wisc.edu/ux3oc1">go.wisc.edu/ux3oc1</a> </div> <div id="title"> Expressive Interfaces for Multi-omics Simulation </div> <br/> <i>As the program notes say, "Dive into the gene pool to see evolution in action," and enjoy "genetic engineering in the privacy of your own home." Requires: 2.5 megabytes of system memory or 3 megabytes under System 7, and a hard drive.</i> -- The New York Times, 1992, reviewing the newly released SimLife <br/> <i>Heard melodies are sweet, but those unheard <br/> Are sweeter; therefore, ye soft pipes, play on... <br/></i> -- Keats <div id="subtitle"> Kris Sankaran <br/> <a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> <br/> 31 | May | 2024 <br/> </div> --- ### Why Simulate? * **Experimental Design**: We have to decide on cohorts, longitudinal sampling plans, and sequencing technologies, not to mention sample sizes. * **Benchmarking**: They allow us to benchmark algorithms even when ground truth labels are unavailable. * **Data Augmentation**: Simulated samples can improve algorithmic performance and integration. * **Calibration**: Statistical performance on simulated data can help us calibrate workflows to improve power and control the false discovery rate. --- ### Why Simulate? * **Experimental Design**: We have to decide on cohorts, longitudinal sampling plans, and sequencing technologies, not to mention sample sizes. * **Benchmarking**: They allow us to benchmark algorithms even when ground truth labels are unavailable. * **Data Augmentation**: Simulated samples can improve algorithmic performance and integration. * **Calibration**: Statistical performance on simulated data can help us calibrate workflows to improve power and control the false discovery rate. <g style="font-size: 20px; margin: 0; line-height: 30px; display: block;"> Examples: BEERS2, compcodeR, deconvR, dropsim, ESCO, FreeHi-C, FreeHiCLite, hierarchicell, kersplat, metaART, MIDAS, Mimesys, MOSim, MSstatsSampleSize, multiomics_networks_simulation, muscat, polyester, powsimR, POWSC, SCDD, scDesign3, SCRIP, Sim3C, SimATAC, SimFFPE, sismonr, spaSim, sparseDOSSA, SparseDC, SPARSim, SPsimSeq, Splat, SymSim, ZINB-WaVE, zingeR </g> --- ### A Grand Challenge *Many single-cell data analysis packages include their own ad hoc data simulators. However, these simulators are usually not available as separate tools or even as a source code, tailored to specific problems studied in corresponding papers and sometimes not comprehensively documented, thus limiting their utility for the broad research community.* -- Challenge 11 from "Eleven grand challenges in single-cell data science" (Lahnemann et al., 2020). --- ### That's a Bit Uncharitable 1. It's true that most researchers do not routinely use simulation. The issue isn't sloppiness in sharing or documenting code, though. 1. The deeper issues are: * Interfaces often suffer from large interaction gulfs (Hutchins et al., 1985). It is hard to specify simulators and iterate. .center[ <img src="figures/interaction_gulfs.png" width=600/> ] --- ### That's a Bit Uncharitable 1. It's true that most researchers do not routinely use simulation. The issue isn't sloppiness in sharing or documenting code, though. 1. The deeper issues are: * We usually view simulators as monolithic. We can adjust parameters but can't interchange components. This makes them hard to adapt to new technologies or applications. .center[ <img src="figures/grammar_of_graphics.png" width=650/> <span style="font-size: 16px"> <br/> Modular decomposition of a plot according to the grammar of graphics (Wilkinson, 2005). </span> ] --- ### Existing Interfaces This is from the [`splatter` introductory vignette](https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatter.html#2_Quickstart) (Zappia et al., 2017). Of the packages I've listed, it has most thoughtful interface. ``` r params <- setParams(params, mean.shape = 0.5, de.prob = 0.2) params #> A Params object of class SplatParams #> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT' #> Secondary parameters are usually set during simulation #> #> Global: #> (GENES) (Cells) [SEED] #> 8000 100 694289 #> #> 29 additional parameters #> #> Batches: #> [Batches] [Batch Cells] [Location] [Scale] [Remove] #> 1 100 0.1 0.1 FALSE #> #> Mean: #> (RATE) (SHAPE) #> 0.5 0.5 ``` --- ### Existing Interfaces .pull-left[ Here is an example from [`scDesign3`](https://songdongyuan1994.github.io/scDesign3/docs/articles/scDesign3.html) (Song et al., 2023). This package has the most generally applicable statistical methodology. ] .pull-right[ ``` r example_simu <- scdesign3( sce = example_sce, assay_use = "counts", celltype = "cell_type", pseudotime = "pseudotime", spatial = NULL, other_covariates = NULL, mu_formula = "s(pseudotime, k = 10, bs = 'cr')", sigma_formula = "1", # If you want your dispersion also varies along pseudotime, use "s(pseudotime, k = 5, bs = 'cr')" family_use = "nb", n_cores = 2, usebam = FALSE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE ) ``` ] --- <g style="font-size: 36px"> <b>Main Idea: Apply interactive computing principles to multi-omics simulation.</b> </g> <img src="figures/Lego_Brick_4.svg" width=50/> <span style="color:#025E73">Modularity</span>: We should be able to build problem-specific simulators by composing simple pieces. <img src="figures/computer_mouse.png" width=35/> <span style="color:#D94E4E">Interactivity</span>: We should give domain researchers agency in designing, evaluating, and modifying statistical hypotheses. <img src="figures/speech_bubbles.png" width=70/> <span style="color:#378C5C">Cooperation</span>: Science is a social activity, and good tools allow researchers to build upon one another's work. --- ### `scDesign3` Review The rest of the talk will be about using these principles to develop a new version of `scDesign3`. Let's review that package's approach. .center[ <img src="figures/scdesignOverview.png" width=800/> ] First, we estimate models `\(\hat{F}_{g}\left(y_{i} \vert \mathbf{x}_{i}\right)\)` for each gene `\(g\)`. * Can use a variety of families: Gaussian, Poisson, Negative Binomial,... * Can learn relationships for each parameter `\(\theta\left(\mathbf{x}_{i}\right)\)`. --- ### `scDesign3` Review .pull-left[ <img src="figures/scdesignOverview2.png" width=400/> ] .pull-right[ 1. Next, we model the joint distribution of quantiles using a copula model. 1. This correlates genes even after conditioning on the same `\(\mathbf{x}_{i}\)`. ] --- .section[ ## Interface Design ] --- ### Nouns & Verbs We can break the interface design question into two parts. 1. **Data Structures**: How can we represent the simulator in a way that is both transparent to a human and precise enough for a computer? 1. **Operations**: How can we encourage users to study and tinker with the data structures? If the resulting grammar is expressive enough, then researchers will be able to solve problems we may not have anticipated. --- ### Nouns: Gene-Level Models For each gene, we can specify the regression formula and distributional family. ``` r margins <- setup_margins(sce, ~ ns(pseudotime, 3), ~ ZINBLSS()) margins ``` <pre class="r-output"><code>## <span style='color: #BB00BB;'>Plan: ## </span><span style='color: #555555;'># A tibble: 6 × 3</span> ## feature family link ## <span style='color: #555555; font-style: italic;'><gene_id></span> <span style='color: #555555; font-style: italic;'><distn></span> <span style='color: #555555; font-style: italic;'><link></span> ## <span style='color: #555555;'>1</span> <span style='color: #00BBBB;'>Pyy</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>2</span> <span style='color: #00BBBB;'>Iapp</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>3</span> <span style='color: #00BBBB;'>Chgb</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>4</span> <span style='color: #00BBBB;'>Rbp4</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>5</span> <span style='color: #00BBBB;'>Spp1</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>6</span> <span style='color: #00BBBB;'>Chga</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #00BBBB;'>Pyy, Iapp, Chgb</span>, and <span style='color: #00BBBB;'>7</span> <span style='color: #00BBBB;'>other features</span> need fitting.<span style='color: #BB00BB;'> ## Estimates: ## </span><span style='color: #555555;'># A tibble: 0 × 0</span> </code></pre> --- ### Nouns: Copula Models We can tie together a collection of marginals using a copula model. .center[ <img src="figures/gene-gene_dependence.png" width=900/> ] --- ### Extensibility For the marginals, we can borrow from existing LSS packages (Rigby et al., 2005; Hofner et al., 2014) to implement gene-wise regressions, * Gaussian, Gamma, Poisson, Binomial, Negative-Binomial, Zero-Inflated Poisson, Zero Inflated Negative Binomial,… We can also build a unified interface to various copula estimation routines, * Vine copulas for capturing higher-order moments. * Gaussian and Student-t copulas with standard sample covariance, shrunken covariance, or graphical lasso precision estimates. --- ### Verbs: <span style="color: #D94E4E">Plot</span> <img src="20240531_files/figure-html/example_plot-1.png" width="800" style="display: block; margin: auto;" /> --- ### Verbs: <span style="color:#025E73">Mutate</span> `mutate` lets you modify a few elements from a larger simulator. ``` r altered <- mutate(margins, Chga:Ins1, link = ~ pseudotime) |> estimate(sce) ``` <img src="20240531_files/figure-html/plot_alteration-1.png" width="800" style="display: block; margin: auto;" /> --- ### Verbs: <span style="color:#025E73">Mutate</span> 1. Here is a more realistic example from a longitudinal microbiome study. 2. We can use `mutate` to define a synthetic null with no disease effect for a known subset of genes. .pull-three-quarters-left[ <img src="figures/nulls_unaltered.png"/> ] .pull-three-quarters-right[ <img src="figures/pairwise_cors.png"/> ] --- ### Verbs: <span style="color:#025E73">Mutate</span> 1. Here is a more realistic example from a longitudinal microbiome study. 2. We can use `mutate` to define a synthetic null with no disease effect for a known subset of genes. .pull-three-quarters-left[ <img src="figures/altered_ns.png"/> ] .pull-three-quarters-right[ <img src="figures/pairwise_cors_altered.png"/> ] --- ### Verbs: <span style="color:#025E73">Join</span> We should make it possible to combine simulators like Lego blocks. ``` r experiments <- list(methylation = SCGEMMETH_sce, rna = SCGEMRNA_sce) families <- list(~ BI(), ~ GaussianLSS()) sims <- experiments |> map2(families, \(x, y) setup_simulator(x, ~ cell_type, y)) ``` <img src="figures/simulator_join_motivation.png"/> --- ### Verbs: <span style="color:#025E73">Join</span> (Copula) One approach is to merge the list of marginal distributions and re-estimate the joint distribution. ``` r sim_joined <- map(sims, estimate, nu = 0.1) |> join_copula(copula_glasso()) ``` This assumes that we have samples where all features are measured. .center[ <img src="figures/copula_join.png" width = 680/> ] --- ### Verbs: <span style="color:#025E73">Join</span> (Conditioning) Alternatively, we can combine two simulators by conditioning them on shared latent structure. ``` r sim_joined <- join_pamona(sims) ``` <pre class="r-output"><code>## Pamona start! ## use random seed: 1 ## Pamona Done! takes 1.464344 seconds </code></pre> .center[ <img src="figures/simulator_join.png" width = 800/> ] --- ### Verbs: <span style="color:#025E73">Join</span> (Conditioning) This used partial manifold alignment (Cao et al., 2020) to learn shared latent variables across assays and works even in the diagonal integration setting. ``` r sim_joined ``` <pre class="r-output"><code>## $methylation ## <span style='color: #555555;'>[Marginals] ## </span><span style='color: #BB00BB;'>Plan: ## </span><span style='color: #555555;'># A tibble: 6 × 3</span> ## feature family link ## <span style='color: #555555; font-style: italic;'><gene_id></span> <span style='color: #555555; font-style: italic;'><distn></span> <span style='color: #555555; font-style: italic;'><link></span> ## <span style='color: #555555;'>1</span> <span style='color: #00BBBB;'>AK123759</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>2</span> <span style='color: #00BBBB;'>ADAM33</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>3</span> <span style='color: #00BBBB;'>NFIX</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>4</span> <span style='color: #00BBBB;'>FOXD2</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>5</span> <span style='color: #00BBBB;'>HLX.AS1</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>6</span> <span style='color: #00BBBB;'>LY86.AS1</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #00BBBB;'>AK123759, ADAM33, NFIX</span>, and <span style='color: #00BBBB;'>24</span> <span style='color: #00BBBB;'>other features</span> need fitting.<span style='color: #BB00BB;'> ## Estimates: ## </span><span style='color: #555555;'># A tibble: 0 × 0</span> ## <span style='color: #555555;'> ## [Dependence] ## </span>0 NULLs with features ## <span style='color: #555555;'> ## [Template Data] ## </span>class: SingleCellExperiment ## dim: 27 142 ## metadata(0): ## assays(1): counts ## rownames(27): AK123759 ADAM33 ... KCNQ2 CDH22 ## rowData names(0): ## colnames(142): CellMeth1 CellMeth2 ... CellMeth141 CellMeth142 ## colData names(10): X1 X2 ... UMAP1 UMAP2 ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): ## ## $rna ## <span style='color: #555555;'>[Marginals] ## </span><span style='color: #BB00BB;'>Plan: ## </span><span style='color: #555555;'># A tibble: 6 × 3</span> ## feature family link ## <span style='color: #555555; font-style: italic;'><gene_id></span> <span style='color: #555555; font-style: italic;'><distn></span> <span style='color: #555555; font-style: italic;'><link></span> ## <span style='color: #555555;'>1</span> <span style='color: #00BBBB;'>ZIC3</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>2</span> <span style='color: #00BBBB;'>KCNQ2</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>3</span> <span style='color: #00BBBB;'>ZFP42</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>4</span> <span style='color: #00BBBB;'>OTX2</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>5</span> <span style='color: #00BBBB;'>SALL4</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>6</span> <span style='color: #00BBBB;'>NANOG</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #00BBBB;'>ZIC3, KCNQ2, ZFP42</span>, and <span style='color: #00BBBB;'>29</span> <span style='color: #00BBBB;'>other features</span> need fitting.<span style='color: #BB00BB;'> ## Estimates: ## </span><span style='color: #555555;'># A tibble: 0 × 0</span> ## <span style='color: #555555;'> ## [Dependence] ## </span>0 NULLs with features ## <span style='color: #555555;'> ## [Template Data] ## </span>class: SingleCellExperiment ## dim: 32 177 ## metadata(0): ## assays(1): logcounts ## rownames(32): ZIC3 KCNQ2 ... NESTIN MYC ## rowData names(0): ## colnames(177): CellRNA1 CellRNA2 ... CellRNA176 CellRNA177 ## colData names(10): X1 X2 ... UMAP1 UMAP2 ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): </code></pre> --- .section[ ## Interactive Demo ] .center[ <span style="font-size: 48px"> [go.wisc.edu/992c78](https://go.wisc.edu/992c78) </span> <img src="figures/demo_link.png" width=380/> ] --- ### Example 2: Power Analysis 1. With a given sample size and experimental design, what types of signals will be recoverable? 1. Even when we don't have access to analytical power formulas, we can draw conclusions using computational experiments. .center[ <img src="figures/concrete_power.png" width=700/> ] --- ### Problem Context (Gavin et al., 2018) studied the metaproteomics of Type I Diabetes (T1D). .pull-left[ * They sampled participants at several disease stages (33 with T1D, 17 at high risk, 29 at low risk, 22 healthy) * Their analysis used sPLS-DA (Lê Cao et al., 2011), a variant of linear discriminant analysis. ] .pull-right[ <img src="figures/gavin_t1d_summary.png" width=500/> ] They found that intestinal and pancreas function are altered even before the onset of full T1D. --- ### Multivariate Power Analysis What would be a good power analysis for a follow-up study? 1. Estimation: Learn a simulator from the 101 available samples. 2. Perturbation: Generate data with varying sample sizes and signals. 3. Evaluation: Compute the average sPLS-DA performance across replicates. .center[ <img src="figures/power_overview.png" width=700/> ] --- ### Estimation 1. We fit Gaussian LSS models to log-ratio transformed measurements of the 427 proteins found in `\(\geq\)` 70% of subjects. 2. Means and variances are allowed to vary across low- and high-risk groups. 3. Main difficulty: The real data have long left tails. .center[ <img src="figures/t1D-multivariate-0.png", width=800/> ] --- ### Perturbation .pull-left[ * We `mutate` the simulators so that the true number of T1D-associated proteins is `\(10, 30, \dots, 90\)` * We choose the true non-nulls from among proteins with observed strong effects - 32 are found significant at `\(\alpha = 0.05\)` in the original data ] .pull-right[ <img src="figures/p_values.png"/> ] --- ### Evaluation * The AUC for tuned sPLS-DA models varies from 0.5 (random guess) to over 0.9 in the high-signal regime. * Potential Extension: Model patient heterogeneity (age, finer risk levels). .center[ <img src="figures/t1D-multivariate-2.png" width=550/> ] --- ### <span style="color:#D94E4E">Interactive Plot</span> We have an experimental interactive visualization that can be used for model criticism. .center[ <a href="http://localhost:5000"> <img src="figures/scdesigner-interactive.png" width=700/> </a> ] --- ### History (Scott et al., 1954) is the earliest example I've found that uses computer simulated data to guide data analysis. .pull-left[ <img src="figures/galaxies-real.png" width=400/> ] .pull-right[ <img src="figures/galaxies-synthetic.png" width=400/> ] --- ### Upcoming Short Course **Registration**: [go.wisc.edu/vsn1ix](https://go.wisc.edu/vsn1ix) (Free, but limited seats) When: June 11, 18, 25 from 10am - 12pm This hands-on short course will: 1. Review the computing and statistical concepts helpful for effective simulation in biological applications. 1. Provide in-depth code examples of how to apply simulation to solve power analysis and benchmarking problems. 1. An opportunity to discuss simulation problems with peers and experts. --- ### Upcoming Short Course **Registration**: [go.wisc.edu/vsn1ix](https://go.wisc.edu/vsn1ix) (Free, but limited seats) June 11 - Data Structures; Differential Testing and Simulation June 18 - Multivariate Analysis and Simulation June 25 - Visual Evaluation; Multi-omics Integration and Simulation <br/> <br/> <br/> .center[ <img src="figures/distributions.png" width=700/> ] --- ### Summary 1. Simulations support the design and comparison of complex workflows. This is especially important for honest power analysis and benchmarking. 1. They encourage critical, quantitative thinking through computational experiments. They can serve as a gateway to advanced statistics. 1. Our interface makes these simulators transparent and adaptable. --- ### Thank you! * Lab Members: Margaret Thairu, Hanying Jiang, Shuchen Yan, Mason Garza, Yuliang Peng, Kaiyan Ma, Kai Cui, and Kobe Uko * Discussions with Saritha Kodikara (UniMelb), Kim-Anh Le Cao (UniMelb), Jingyi Jessica Li (UCLA), Andres Colubri (UMass) * Funding: NIGMS R01GM152744. --- ### References [1] K. Cao et al. "Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona". In: _Bioinformatics_ 38 (2020), pp. 211 - 219. <https://api.semanticscholar.org/CorpusID:226959190>. [2] P. G. Gavin et al. "Intestinal Metaproteomics Reveals Host-Microbiota Interactions in Subjects at Risk for Type 1 Diabetes". In: _Diabetes Care_ 41.10 (Aug. 2018), p. 2178–2186. ISSN: 1935-5548. DOI: 10.2337/dc18-0777. <http://dx.doi.org/10.2337/dc18-0777>. [3] J. Heer. "Agency plus automation: Designing artificial intelligence into interactive systems". In: _Proceedings of the National Academy of Sciences_ 116.6 (Feb. 2019), p. 1844–1850. ISSN: 1091-6490. DOI: 10.1073/pnas.1807184115. <http://dx.doi.org/10.1073/pnas.1807184115>. --- ### References [1] B. Hofner et al. "gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework". In: _arXiv: Computation_ (2014). <https://api.semanticscholar.org/CorpusID:62302678>. [2] D. Lahnemann et al. "Eleven grand challenges in single-cell data science". In: _Genome Biology_ 21 (2020). <https://api.semanticscholar.org/CorpusID:211054289>. [3] K. Lê Cao et al. "Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems". In: _BMC Bioinformatics_ 12.1 (Jun. 2011). ISSN: 1471-2105. DOI: 10.1186/1471-2105-12-253. <http://dx.doi.org/10.1186/1471-2105-12-253>. --- ### References [1] R. A. Rigby et al. "Generalized additive models for location, scale and shape". In: _Journal of the Royal Statistical Society: Series C (Applied Statistics)_ 54 (2005). <https://api.semanticscholar.org/CorpusID:56076930>. [2] M. S. M. Sajjadi et al. "Assessing Generative Models via Precision and Recall". In: _ArXiv_ abs/1806.00035 (2018). <https://api.semanticscholar.org/CorpusID:44104089>. [3] E. L. Scott et al. "Comparison of the Synthetic and Actual Distribution of Galaxies on a Photographic Plate." In: _The Astrophysical Journal_ 119 (Jan. 1954), p. 91. ISSN: 1538-4357. DOI: 10.1086/145799. <http://dx.doi.org/10.1086/145799>. --- ### References [1] D. Song et al. "scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics". In: _Nature Biotechnology_ (2023), pp. 1-6. <https://api.semanticscholar.org/CorpusID:258638565>. [2] L. Zappia et al. "Splatter: simulation of single-cell RNA sequencing data". In: _Genome Biology_ 18 (2017). <https://api.semanticscholar.org/CorpusID:9332625>. --- ### Verbs: Plot ``` r sample(simulator) |> plot_correlations(points = point_opts) ``` <img src="20240531_files/figure-html/correlation_plot-1.png" width="600" style="display: block; margin: auto;" /> --- ### Verbs: <span style="color:#D94E4E">Mutate</span> There are many reasons we might want to alter an initial simulator, * **Synthetic Nulls**: We can define negative controls to safeguard against false discoveries. * **Iteration**: We may want to improve the simulator’s goodness of fit by modifying the regression. * **Power Analysis**: We may want to construct data from alternative experimental designs or biological signal strengths. --- ### <span style="color:#025E73">Modularity</span> - Augment Another way to alter a simulator is to make implicit data explicitly available, in the spirit of the broom package's `augment`. For example, these data appear to have mouse-level effects. .center[ <img src="figures/subject_effect_real.png" width=750/> ] --- ### <span style="color:#025E73">Modularity</span> - Augment 1. Unfortunately, none of the packages we use at the marginal estimation step support random effects. 1. We defined an `augment` function that defines a simple proxy, * Compute sample-wise low-dimensional multidimensional scaling coordinates `\(z_{i}\)`. * For mouse `\(s\)`, condition on its average coordinates: `\(F_{g}\left(y_{ig} \vert x_{i}, \bar{z}_{s\left(i\right)}\right)\)` * Use with caution: The `\(z_{i}\)` will be correlated with the diet effect. .large-code[ ``` r exper <- augment_mds(exper, K = 4) |> groupwise_average(starts_with("MDS"), MouseID) ``` ] --- ### <span style="color:#025E73">Modularity</span> - Augment * Compute sample-wise low-dimensional multidimensional scaling coordinates `\(z_{i}\)`. * For mouse `\(s\)`, condition on its average coordinates: `\(F_{g}\left(y_{ig} \vert x_{i}, \bar{z}_{s\left(i\right)}\right)\)` * Use with caution: The `\(z_{i}\)` will be correlated with the diet effect. .center[ <img src="figures/mouse_effect_comparison.png" width=1000/> ] --- ### Example: Multi-omics Power Analysis 1. In practice, most power analyses are univariate, e.g., to guide differential expression. 2. How should we do multivariate power analysis, especially with several tables? --- ### Example Let’s consider the microbiome + metabolome mouse study from (Callahan et al., 2016). This had 12 samples. 1. How does estimation quality improve with sample size? 2. Is it worth gathering unpaired samples? (mosaic designs) .center[ <img src="figures/cca_f1000.gif" width=650/> ] --- ### Example: Estimate Simulators 1. Fit ZINB models to the microbiome. 2. Fit Gaussian models to the metabolome. 3. Join them using highly regularized covariance estimates. .center[ <img src="figures/cca_microbiome_hist.png"/> ] --- ### Part 1: Estimate Simulators 1. Fit ZINB models to the microbiome. 2. Fit Gaussian models to the metabolome. 3. Join them using highly regularized covariance estimates. .center[ <img src="figures/cca_metab_hist.png"/> ] --- ### Part 2: Hypothetical Experiments .pull-left[ 1. Across sample sizes `\(n\)`, we simulate new data, fit sparse CCA, and identify a basis `\(\hat{V}_{n}\)` spanning the `\(K = 5\)` left and right sparse CCA factors 1. We use `\(\hat{V}_{5000}\)` as a reference and compute canonical angles (Golub et al., 1994) with all smaller sample sizes. ] .pull-right[ <img src="figures/canonical_angles_paper.png"/> Figure from (Arikawa, 2018). ] --- ### Part 2: Hypothetical Experiments With more samples, the canonical angles shrink predictably. .center[ <img src="figures/canonical_angles.png" width=750/> ] --- ### Mosaic Case Here, we have allowed the sample sizes to differ and use KNN imputation to run CCA on an imputed table. .center[ <img src="figures/mosaic_example.png" width=750/> ] --- ### Aside: Negative Control The default sPLS-DA visualization output can separate classes even when no signals exist. This is a consequence of overfitting. .center[ <img src="figures/t1D-null-overfit.png" width=400/> ] --- ### Future Directions: Improving Generators .pull-left[ GAMLSS models are simple and interpretable. However, * They can oversmooth sharp transitions. * Fitting each feature separately is both statistically and computationally inefficient. ] .pull-right[ <img src="figures/spatial_example_smoothness.png"/> ] --- ### Future Directions: Quantitative Evaluation .pull-left[ We need formal ways for gauging the quality of a simulator. * Existing evaluation criteria for simulated omics data rely on ad-hoc summaries. * GAN precision-recall measures can offer a more systematic approach (Sajjadi et al., 2018). ] .pull-right[ <img src="figures/gan-eval2.png"/> ] --- ### Future Directions: Quantitative Evaluation .pull-left[ We need formal ways for gauging the quality of a simulator. * Existing evaluation criteria for simulated omics data rely on ad-hoc summaries. * GAN precision-recall measures can offer a more systematic approach (Sajjadi et al., 2018). ] .pull-right[ <img src="figures/gan-eval1.png"/> ] --- ### Future Directions: Autocompletion * Many parameters can influence simulation quality: Preprocessing, distributional assumptions, covariates to include, spline basis * Can we "recommend" the next step of simulator construction, based on initial user specification (Heer, 2019)? .center[ <img src="figures/wrangler.png" width=500/> ]