class: title background-size: cover <div id="links"> Slides: <a href="https://go.wisc.edu/694o2e">go.wisc.edu/694o2e</a> </div> <div id="title"> Transparent Synthetic Data Generation </div> <br/> <i>As the program notes say, "Dive into the gene pool to see evolution in action," and enjoy "genetic engineering in the privacy of your own home." Requires: 2.5 megabytes of system memory or 3 megabytes under System 7, and a hard drive.</i> -- The New York Times, 1992, reviewing the newly released SimLife <br/> <i>Heard melodies are sweet, but those unheard <br/> Are sweeter; therefore, ye soft pipes, play on... <br/></i> -- Keats <div id="subtitle"> Kris Sankaran <br/> <a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> <br/> 11 | April | 2024 <br/> Machine Learning Lunch Meetings<br/> </div> --- ### Why Simulate? * **Experimental Design**: We have to decide on cohorts, longitudinal sampling plans, and sequencing technologies, not to mention sample sizes. * **Benchmarking**: They allow us to benchmark algorithms even when ground truth labels are unavailable. * **Data Augmentation**: Simulated samples can improve algorithmic performance and integration. * **Calibration**: Statistical performance on simulated data can help us calibrate workflows to improve power and control the false discovery rate. --- ### Why Simulate? * **Experimental Design**: We have to decide on cohorts, longitudinal sampling plans, and sequencing technologies, not to mention sample sizes. * **Benchmarking**: They allow us to benchmark algorithms even when ground truth labels are unavailable. * **Data Augmentation**: Simulated samples can improve algorithmic performance and integration. * **Calibration**: Statistical performance on simulated data can help us calibrate workflows to improve power and control the false discovery rate. <g style="font-size: 20px; margin: 0; line-height: 30px; display: block;"> Examples: BASICS, compcodeR, deconvR, dropsim, ESCO, FreeHi-C, FreeHiCLite, hierarchicell, kersplat, metaART, MOSim, MSstatsSampleSize, MIDAS, Mimesys, multiomics_networks_simulation, muscat, powsimR, POWSC, SCDD, scDesign3, SCRIP, Sim3C, SimATAC, SimFFPE, sismonr, spaSim, sparseDOSSA, Splat, SPARSim, SPsimSeq, SparseDC, SymSim, ZINB-WaVE, zingeR, ... </g> --- ### A Grand Challenge *Many single-cell data analysis packages include their own ad hoc data simulators [111, 211, 241, 264, 349–353]. However, these simulators are usually not available as separate tools or even as a source code, tailored to specific problems studied in corresponding papers and sometimes not comprehensively documented, thus limiting their utility for the broad research community.* -- Challenge 11 from "Eleven grand challenges in single-cell data science" (Lahnemann, Koster, Szczurek, McCarthy, Hicks, Robinson, Vallejos, Campbell, Beerenwinkel, Mahfouz, Pinello, Skums, Stamatakis, Attolini, Aparicio, Baaijens, Balvert, de Barbanson, Cappuccio, Corleone, Dutilh, Florescu, Guryev, Holmer, Jahn, Lobo, Keizer, Khatri, Kiełbasa, Korbel, Kozlov, Kuo, Lelieveldt, Măndoiu, Marioni, Marschall, Molder, Niknejad, Raczkowski, Reinders, de Ridder, Saliba, Somarakis, Stegle, Theis, Yang, Zelikovsky, Mchardy, Raphael, Shah, and Schönhuth, 2020). --- ### That's a Bit Uncharitable 1. It's true that most researchers do not routinely use simulation. The issue isn't sloppiness in sharing or documenting code, though. 1. The deeper issue is that we usually view simulators as monolithic. We can adjust parameters but can't interchange components. This makes them hard to adapt to new technologies or applications. 1. I will show two examples of well-designed simulators. How would you go about trying to extend them? --- ### Existing Interfaces This is from the [`splatter` introductory vignette](https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatter.html#2_Quickstart) (Zappia, Phipson, and Oshlack, 2017). Of the packages I've listed, it has most thoughtful interface. ```r params <- setParams(params, mean.shape = 0.5, de.prob = 0.2) params #> A Params object of class SplatParams #> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT' #> Secondary parameters are usually set during simulation #> #> Global: #> (GENES) (Cells) [SEED] #> 8000 100 694289 #> #> 29 additional parameters #> #> Batches: #> [Batches] [Batch Cells] [Location] [Scale] [Remove] #> 1 100 0.1 0.1 FALSE #> #> Mean: #> (RATE) (SHAPE) #> 0.5 0.5 ``` --- ### Existing Interfaces .pull-left[ Here is an example from the [`scDesign3` Quick Start](https://songdongyuan1994.github.io/scDesign3/docs/articles/scDesign3.html) (Song, Wang, Yan, Liu, Sun, and Li, 2023). This package has the most generally applicable statistical methodology. ] .pull-right[ ```r example_simu <- scdesign3( sce = example_sce, assay_use = "counts", celltype = "cell_type", pseudotime = "pseudotime", spatial = NULL, other_covariates = NULL, mu_formula = "s(pseudotime, k = 10, bs = 'cr')", sigma_formula = "1", # If you want your dispersion also varies along pseudotime, use "s(pseudotime, k = 5, bs = 'cr')" family_use = "nb", n_cores = 2, usebam = FALSE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE ) ``` ] --- <g style="font-size: 36px"> <b>Main Idea: Apply interactive computing principles to multi-omics simulation.</b> </g> <img src="figures/Lego_Brick_4.svg" width=50/> <span style="color:#025E73">Modularity</span>: We should be able to build problem-specific simulators by composing simple pieces. <img src="figures/computer_mouse.png" width=35/> <span style="color:#D94E4E">Interactivity</span>: We should give domain researchers agency in designing, evaluating, and modifying statistical hypotheses. <img src="figures/speech_bubbles.png" width=70/> <span style="color:#378C5C">Communication</span>: Model-building is a social activity, and well-designed tools encourage substantive community engagement. --- ### `scDesign3` Review The rest of the talk will be about using these principles to develop a new version of `scDesign3`. Let's review that package's approach. .center[ <img src="figures/scdesignOverview.png" width=800/> ] First, we estimate models `\(\hat{F}_{g}\left(y_{i} \vert \mathbf{x}_{i}\right)\)` for each gene `\(g\)`. * Can use a variety of families: Gaussian, Poisson, Negative Binomial,... * Can learn relationships for each parameter `\(\theta\left(\mathbf{x}_{i}\right)\)`. --- ### `scDesign3` Review .pull-left[ <img src="figures/scdesignOverview2.png" width=400/> ] .pull-right[ 1. Next, we model the joint distribution of quantiles using a copula model. 1. This correlates genes even after conditioning on the same `\(\mathbf{x}_{i}\)`. ] --- .section[ ## Interface Design ] --- ### Nouns & Verbs We can break the interface design question into two parts. 1. **Data Structures**: How can we represent the simulator in a way that is both transparent to a human and precise enough for a computer? 1. **Operations**: How can we encourage users to study and tinker with the data structures? If the resulting grammar is expressive enough, then researchers will be able to solve problems we may not have anticipated. --- ### Nouns: Gene-Level Models For each gene, we can specify the regression formula and distributional family. ```r margins <- setup_margins(sce, ~ ns(pseudotime, 3), ~ ZINBLSS()) margins ``` <pre class="r-output"><code>## <span style='color: #BB00BB;'>Plan: ## </span><span style='color: #555555;'># A tibble: 6 × 3</span> ## feature family link ## <span style='color: #555555; font-style: italic;'><gene_id></span> <span style='color: #555555; font-style: italic;'><distn></span> <span style='color: #555555; font-style: italic;'><link></span> ## <span style='color: #555555;'>1</span> <span style='color: #00BBBB;'>Pyy</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>2</span> <span style='color: #00BBBB;'>Iapp</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>3</span> <span style='color: #00BBBB;'>Chgb</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>4</span> <span style='color: #00BBBB;'>Rbp4</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>5</span> <span style='color: #00BBBB;'>Spp1</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>6</span> <span style='color: #00BBBB;'>Chga</span> ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #00BBBB;'>Pyy, Iapp, Chgb</span>, and <span style='color: #00BBBB;'>7</span> <span style='color: #00BBBB;'>other features</span> need fitting.<span style='color: #BB00BB;'> ## Estimates: ## </span><span style='color: #555555;'># A tibble: 0 × 0</span> </code></pre> --- ### Nouns: Copula Models We can tie together a collection of marginals using a copula model. ```r simulator <- new( "JointDistribution", margins = margins, dependence = copula_gaussian() ) ``` We can support different correlation structures across groups. ```r simulator <- new( "JointDistribution", margins = margins, dependence = copula_gaussian(~ cell_type) ) ``` --- ### Extensibility For the marginals, we can borrow from existing LSS packages (Rigby and Stasinopoulos, 2005; Hofner, Mayr, and Schmid, 2014) to implement gene-wise regressions, * Gaussian, Gamma, Poisson, Binomial, Negative-Binomial, Zero-Inflated Poisson, Zero Inflated Negative Binomial,… We can also build a unified interface to various copula estimation routines, * Vine copulas for capturing higher-order moments. * Gaussian and Student-t copulas with standard sample covariance, shrunken covariance, or graphical lasso precision estimates. * `copula_vine(), copula_adaptive(), copula_glasso(), ...` --- ### Verbs: <span style="color: #D94E4E">Plot</span> ```r plot(simulator, "pseudotime", line_opts = line_opts, ribbon_opts = ribbon_opts, point_opts = point_opts) + facet_wrap(~ feature, ncol = 4) ``` <img src="20240411_files/figure-html/example_plot-1.png" width="800" style="display: block; margin: auto;" /> --- ### Verbs: <span style="color:#025E73">Mutate</span> `mutate` lets you modify a few elements from a larger simulator. ```r altered <- mutate(margins, Chga:Ins1, link = ~ pseudotime) |> estimate(sce) ``` <img src="20240411_files/figure-html/plot_alteration-1.png" width="800" style="display: block; margin: auto;" /> --- ### Verbs: <span style="color:#025E73">Mutate</span> We can layer several changes on top of one another. Syntax color highlights modifications that haven’t been re-fit. ```r margins |> mutate(Chga:Ins1, link = ~ pseudotime) |> mutate(c("Pyy", "Rbp4"), family = ~ GaussianLSS()) |> mutate(Iapp, link = list(mu = ~ ns(pseudotime, df = 5), sigma = ~ 1, nu = ~ 1)) ``` <pre class="r-output"><code>## <span style='color: #BB00BB;'>Plan: ## </span><span style='color: #555555;'># A tibble: 6 × 3</span> ## feature family link ## <span style='color: #555555; font-style: italic;'><gene_id></span> <span style='color: #555555; font-style: italic;'><distn></span> <span style='color: #555555; font-style: italic;'><link></span> ## <span style='color: #555555;'>1</span> <span style='color: #00BBBB;'>Pyy</span> Gaussian [mu,sigma] ~ns(pseudotime, 3) ## <span style='color: #555555;'>2</span> <span style='color: #00BBBB;'>Iapp</span> ZINBI [mu,sigma,nu] mu~ns(pseudotime, df = ... ## <span style='color: #555555;'>3</span> Chgb ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>4</span> <span style='color: #00BBBB;'>Rbp4</span> Gaussian [mu,sigma] ~ns(pseudotime, 3) ## <span style='color: #555555;'>5</span> Spp1 ZINBI [mu,sigma,nu] ~ns(pseudotime, 3) ## <span style='color: #555555;'>6</span> <span style='color: #00BBBB;'>Chga</span> ZINBI [mu,sigma,nu] ~pseudotime ## <span style='color: #00BBBB;'>Pyy, Iapp, Rbp4</span>, and <span style='color: #00BBBB;'>3</span> <span style='color: #00BBBB;'>other features</span> need fitting.<span style='color: #BB00BB;'> ## Estimates: ## </span><span style='color: #555555;'># A tibble: 3 × 2</span> ## feature fit ## <span style='color: #555555; font-style: italic;'><chr></span> <span style='color: #555555; font-style: italic;'><list></span> ## <span style='color: #555555;'>1</span> Pyy <span style='color: #555555;'><glmbsLSS></span> ## <span style='color: #555555;'>2</span> Iapp <span style='color: #555555;'><glmbsLSS></span> ## <span style='color: #555555;'>3</span> Chgb <span style='color: #555555;'><glmbsLSS></span> ## ... and 7 additional features. </code></pre> --- ### Verbs: <span style="color:#025E73">Mutate</span> 1. Here is a more realistic example from a longitudinal microbiome study. 2. We can use `mutate` to define a synthetic null with no disease effect for a known subset of genes. .pull-three-quarters-left[ <img src="figures/nulls_unaltered.png"/> ] .pull-three-quarters-right[ <img src="figures/pairwise_cors.png"/> ] --- ### Verbs: <span style="color:#025E73">Mutate</span> 1. Here is a more realistic example from a longitudinal microbiome study. 2. We can use `mutate` to define a synthetic null with no disease effect for a known subset of genes. .pull-three-quarters-left[ <img src="figures/altered_ns.png"/> ] .pull-three-quarters-right[ <img src="figures/pairwise_cors_altered.png"/> ] --- ### Verbs: <span style="color:#025E73">Join</span> We should make it possible to combine simulators like Lego blocks. ```r experiments <- list(methylation = SCGEMMETH_sce, rna = SCGEMRNA_sce) families <- list(~ BI(), ~ GaussianLSS()) sims <- experiments |> map2(families, \(x, y) setup_simulator(x, ~ cell_type, y)) ``` <img src="figures/simulator_join_motivation.png"/> --- ### Verbs: <span style="color:#025E73">Join</span> (Copula) One approach is to merge the list of marginal distributions and re-estimate the joint distribution. ```r sim_joined <- map(sims, estimate, nu = 0.1) |> join_copula(copula_glasso()) ``` This assumes that we have samples where all features are measured. .center[ <img src="figures/copula_join.png" width = 680/> ] --- ### Verbs: <span style="color:#025E73">Join</span> (Conditioning) Alternatively, we can combine two simulators by conditioning them on shared latent structure. ```r sim_joined <- join_pamona(sims) ``` <pre class="r-output"><code>## Pamona start! ## use random seed: 1 ## Pamona Done! takes 4.380018 seconds </code></pre> .center[ <img src="figures/simulator_join.png" width = 800/> ] --- ### Verbs: <span style="color:#025E73">Join</span> (Conditioning) This used partial manifold alignment (Cao, Hong, and Wan, 2020) to learn shared latent variables across assays and works even in the diagonal integration setting. ```r sim_joined ``` <pre class="r-output"><code>## $methylation ## <span style='color: #555555;'>[Marginals] ## </span><span style='color: #BB00BB;'>Plan: ## </span><span style='color: #555555;'># A tibble: 6 × 3</span> ## feature family link ## <span style='color: #555555; font-style: italic;'><gene_id></span> <span style='color: #555555; font-style: italic;'><distn></span> <span style='color: #555555; font-style: italic;'><link></span> ## <span style='color: #555555;'>1</span> <span style='color: #00BBBB;'>AK123759</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>2</span> <span style='color: #00BBBB;'>ADAM33</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>3</span> <span style='color: #00BBBB;'>NFIX</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>4</span> <span style='color: #00BBBB;'>FOXD2</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>5</span> <span style='color: #00BBBB;'>HLX.AS1</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>6</span> <span style='color: #00BBBB;'>LY86.AS1</span> BI [mu] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #00BBBB;'>AK123759, ADAM33, NFIX</span>, and <span style='color: #00BBBB;'>24</span> <span style='color: #00BBBB;'>other features</span> need fitting.<span style='color: #BB00BB;'> ## Estimates: ## </span><span style='color: #555555;'># A tibble: 0 × 0</span> ## <span style='color: #555555;'> ## [Dependence] ## </span>0 NULLs with features ## <span style='color: #555555;'> ## [Template Data] ## </span>class: SingleCellExperiment ## dim: 27 142 ## metadata(0): ## assays(1): counts ## rownames(27): AK123759 ADAM33 ... KCNQ2 CDH22 ## rowData names(0): ## colnames(142): CellMeth1 CellMeth2 ... CellMeth141 CellMeth142 ## colData names(10): X1 X2 ... UMAP1 UMAP2 ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): ## ## $rna ## <span style='color: #555555;'>[Marginals] ## </span><span style='color: #BB00BB;'>Plan: ## </span><span style='color: #555555;'># A tibble: 6 × 3</span> ## feature family link ## <span style='color: #555555; font-style: italic;'><gene_id></span> <span style='color: #555555; font-style: italic;'><distn></span> <span style='color: #555555; font-style: italic;'><link></span> ## <span style='color: #555555;'>1</span> <span style='color: #00BBBB;'>ZIC3</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>2</span> <span style='color: #00BBBB;'>KCNQ2</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>3</span> <span style='color: #00BBBB;'>ZFP42</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>4</span> <span style='color: #00BBBB;'>OTX2</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>5</span> <span style='color: #00BBBB;'>SALL4</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #555555;'>6</span> <span style='color: #00BBBB;'>NANOG</span> Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2 ## <span style='color: #00BBBB;'>ZIC3, KCNQ2, ZFP42</span>, and <span style='color: #00BBBB;'>29</span> <span style='color: #00BBBB;'>other features</span> need fitting.<span style='color: #BB00BB;'> ## Estimates: ## </span><span style='color: #555555;'># A tibble: 0 × 0</span> ## <span style='color: #555555;'> ## [Dependence] ## </span>0 NULLs with features ## <span style='color: #555555;'> ## [Template Data] ## </span>class: SingleCellExperiment ## dim: 32 177 ## metadata(0): ## assays(1): logcounts ## rownames(32): ZIC3 KCNQ2 ... NESTIN MYC ## rowData names(0): ## colnames(177): CellRNA1 CellRNA2 ... CellRNA176 CellRNA177 ## colData names(10): X1 X2 ... UMAP1 UMAP2 ## reducedDimNames(0): ## mainExpName: NULL ## altExpNames(0): </code></pre> --- .section[ ## Interactive Demo ] <br/> <br/> <br/> <div style="font-size: 34px;"> Live Code: <a href="https://go.wisc.edu/9n0w7t/">https://go.wisc.edu/9n0w7t/</a> <br/> Colab Notebook: <a href="https://go.wisc.edu/u5a94m">https://go.wisc.edu/u5a94m</a> <br/> Compiled Notebook: <a href="https://go.wisc.edu/62eh36"> https://go.wisc.edu/62eh36</a> </div> --- ### Example 2: Power Analysis With a given sample size and experimental design, what types of signals will be recoverable? * This prevents wasting effort on underpowered studies. * In practice, researchers often report power for tests with analytical power curves (e.g., `\(t\)`-tests), even if that is not the planned analysis. Simulation can guide power analysis that reflects the planned analysis approach. --- ### Problem Context (Gavin, Mullaney, Loo, Cao, Gottlieb, Hill, Zipris, and Hamilton-Williams, 2018) studied the metaproteomics of Type I Diabetes (T1D). .pull-left[ * They sampled participants at several disease stages (33 with T1D, 17 at high risk, 29 at low risk, 22 healthy) * Their analysis used sPLS-DA (Lê Cao, Boitard, and Besse, 2011), a variant of linear discriminant analysis. ] .pull-right[ <img src="figures/gavin_t1d_summary.png" width=500/> ] They found that intestinal and pancreas function are altered even before the onset of full T1D. --- ### Multivariate Power Analysis What would be a good power analysis for a follow-up study? 1. Estimation: Learn a simulator from the 101 available samples. 2. Perturbation: Generate data with varying sample sizes and signals. 3. Evaluation: Compute the average sPLS-DA performance across replicates. .center[ <img src="figures/power_overview.png" width=700/> ] --- ### Estimation 1. We fit Gaussian LSS models to log-ratio transformed measurements of the 427 proteins found in `\(\geq\)` 70% of subjects. 2. Means and variances are allowed to vary across low and high risk groups. 3. Main difficulty: The real data have long left tails. .center[ <img src="figures/t1D-multivariate-0.png", width=800/> ] --- ### Perturbation .pull-left[ * We `mutate` the simulators so that the true number of T1D-associated proteins is `\(s \in \{10, 30, \dots, 90\}\)` * We choose the true non-nulls from among proteins with with observed strong effects - 32 are found significant at `\(\alpha = 0.05\)` in the original data ] .pull-right[ <img src="figures/p_values.png"/> ] --- ### Evaluation * The AUC for tuned sPLS-DA models varies from 0.5 (random guess) to over 0.9 in the high-signal regime. * Potential Extension: Model patient heterogeneity (age, finer risk levels). .center[ <img src="figures/t1D-multivariate-2.png" width=550/> ] --- ### <span style="color:#D94E4E">Interactive Plot</span> We have an experimental interactive visualization that can be used for model criticism. You can check out the code: https://go.wisc.edu/q43ol3 .center[ <a href="http://localhost:5000"> <img src="figures/scdesigner-interactive.png" width=700/> </a> ] --- ### Future Directions: Improving Generators .pull-left[ GAMLSS models are simple and interpretable. However, * They can oversmooth sharp transitions. * Fitting each feature separately is both statistically and computationally inefficient. ] .pull-right[ <img src="figures/spatial_example_smoothness.png"/> ] --- ### Future Directions: Quantitative Evaluation .pull-left[ We need formal ways for gauging the quality of a simulator. * Existing evaluation criteria for simulated omics data rely on ad-hoc summaries. * GAN precision-recall measures can offer a more systematic approach (Sajjadi, Bachem, Lucic, Bousquet, and Gelly, 2018). ] .pull-right[ <img src="figures/gan-eval2.png"/> ] --- ### Future Directions: Quantitative Evaluation .pull-left[ We need formal ways for gauging the quality of a simulator. * Existing evaluation criteria for simulated omics data rely on ad-hoc summaries. * GAN precision-recall measures can offer a more systematic approach (Sajjadi et al., 2018). ] .pull-right[ <img src="figures/gan-eval1.png"/> ] --- ### Future Directions: Autocompletion * Many parameters can influence simulation quality: Preprocessing, distributional assumptions, covariates to include, spline basis * Can we "recommend" the next step of simulator construction, based on initial user specification (Heer, 2019)? .center[ <img src="figures/wrangler.png" width=500/> ] --- ### History (Scott, Shane, and Swanson, 1954) is the earliest example I've found that uses computer simulated data to guide data analysis. .pull-left[ <img src="figures/galaxies-real.png" width=400/> ] .pull-right[ <img src="figures/galaxies-synthetic.png" width=400/> ] --- ### Thank you! * Lab Members: Margaret Thairu, Hanying Jiang, Shuchen Yan, Mason Garza, Yuliang Peng, Kaiyan Ma, Kai Cui, and Kobe Uko * Discussions with Jingyi Jessica Li (UCLA) and Andres Colubri (UMass) * Funding: NIGMS R01GM152744. --- ### References [1] D. Lahnemann, J. Koster, E. Szczurek, et al. "Eleven grand challenges in single-cell data science". In: _Genome Biology_ 21 (2020). <https://api.semanticscholar.org/CorpusID:211054289>. [2] L. Zappia, B. Phipson, and A. Oshlack. "Splatter: simulation of single-cell RNA sequencing data". In: _Genome Biology_ 18 (2017). <https://api.semanticscholar.org/CorpusID:9332625>. [3] D. Song, Q. Wang, G. Yan, et al. "scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics". In: _Nature Biotechnology_ (2023), pp. 1-6. <https://api.semanticscholar.org/CorpusID:258638565>. --- ### References [4] R. A. Rigby and D. M. Stasinopoulos. "Generalized additive models for location, scale and shape". In: _Journal of the Royal Statistical Society: Series C (Applied Statistics)_ 54 (2005). <https://api.semanticscholar.org/CorpusID:56076930>. [5] B. Hofner, A. Mayr, and M. Schmid. "gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework". In: _arXiv: Computation_ (2014). <https://api.semanticscholar.org/CorpusID:62302678>. [6] K. Cao, Y. Hong, and L. Wan. "Manifold alignment for heterogeneous single-cell multi-omics data integration using Pamona". In: _Bioinformatics_ 38 (2020), pp. 211 - 219. <https://api.semanticscholar.org/CorpusID:226959190>. --- ### References [7] P. G. Gavin, J. A. Mullaney, D. Loo, et al. "Intestinal Metaproteomics Reveals Host-Microbiota Interactions in Subjects at Risk for Type 1 Diabetes". In: _Diabetes Care_ 41.10 (Aug. 2018), p. 2178–2186. ISSN: 1935-5548. DOI: 10.2337/dc18-0777. <http://dx.doi.org/10.2337/dc18-0777>. [8] B. J. Callahan, K. Sankaran, J. Fukuyama, et al. "Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses". In: _F1000Research_ 5 (2016). <https://api.semanticscholar.org/CorpusID:6468345>. [9] G. H. Golub and H. Zha. "Perturbation analysis of the canonical correlations of matrix pairs". In: _Linear Algebra and its Applications_ 210 (1994), pp. 3-28. <https://api.semanticscholar.org/CorpusID:123160725>. --- ### References [10] K. Arikawa. "Theoretical framework for analyzing structural compliance properties of proteins". In: _Biophysics and Physicobiology_ 15 (2018), pp. 58 - 74. <https://api.semanticscholar.org/CorpusID:4472297>. [11] K. Lê Cao, S. Boitard, and P. Besse. "Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems". In: _BMC Bioinformatics_ 12.1 (Jun. 2011). ISSN: 1471-2105. DOI: 10.1186/1471-2105-12-253. <http://dx.doi.org/10.1186/1471-2105-12-253>. [12] M. S. M. Sajjadi, O. Bachem, M. Lucic, et al. "Assessing Generative Models via Precision and Recall". In: _ArXiv_ abs/1806.00035 (2018). <https://api.semanticscholar.org/CorpusID:44104089>. --- ### References [13] J. Heer. "Agency plus automation: Designing artificial intelligence into interactive systems". In: _Proceedings of the National Academy of Sciences_ 116.6 (Feb. 2019), p. 1844–1850. ISSN: 1091-6490. DOI: 10.1073/pnas.1807184115. <http://dx.doi.org/10.1073/pnas.1807184115>. [14] E. L. Scott, C. D. Shane, and M. D. Swanson. "Comparison of the Synthetic and Actual Distribution of Galaxies on a Photographic Plate." In: _The Astrophysical Journal_ 119 (Jan. 1954), p. 91. ISSN: 1538-4357. DOI: 10.1086/145799. <http://dx.doi.org/10.1086/145799>. --- ### Verbs: Plot ```r sample(simulator) |> plot_correlations(points = point_opts) ``` <img src="20240411_files/figure-html/correlation_plot-1.png" width="600" style="display: block; margin: auto;" /> --- ### Verbs: <span style="color:#D94E4E">Mutate</span> There are many reasons we might want to alter an initial simulator, * **Synthetic Nulls**: We can define negative controls to safeguard against false discoveries. * **Iteration**: We may want to improve the simulator’s goodness of fit by modifying the regression. * **Power Analysis**: We may want to construct data from alternative experimental designs or biological signal strengths. --- ### <span style="color:#025E73">Modularity</span> - Augment Another way to alter a simulator is to make implicit data explicitly available, in the spirit of the broom package's `augment`. For example, these data appear to have mouse-level effects. .center[ <img src="figures/subject_effect_real.png" width=750/> ] --- ### <span style="color:#025E73">Modularity</span> - Augment 1. Unfortunately, none of the packages we use at the marginal estimation step support random effects. 1. We defined an `augment` function that defines a simple proxy, * Compute sample-wise low-dimensional multidimensional scaling coordinates `\(z_{i}\)`. * For mouse `\(s\)`, condition on its average coordinates: `\(F_{g}\left(y_{ig} \vert x_{i}, \bar{z}_{s\left(i\right)}\right)\)` * Use with caution: The `\(z_{i}\)` will be correlated with the diet effect. .large-code[ ```r exper <- augment_mds(exper, K = 4) |> groupwise_average(starts_with("MDS"), MouseID) ``` ] --- ### <span style="color:#025E73">Modularity</span> - Augment * Compute sample-wise low-dimensional multidimensional scaling coordinates `\(z_{i}\)`. * For mouse `\(s\)`, condition on its average coordinates: `\(F_{g}\left(y_{ig} \vert x_{i}, \bar{z}_{s\left(i\right)}\right)\)` * Use with caution: The `\(z_{i}\)` will be correlated with the diet effect. .center[ <img src="figures/mouse_effect_comparison.png" width=1000/> ] --- ### Example: Multi-omics Power Analysis 1. In practice, most power analyses are univariate, e.g., to guide differential expression. 2. How should we do multivariate power analysis, especially with several tables? --- ### Example Let’s consider the microbiome + metabolome mouse study from (Callahan, Sankaran, Fukuyama, McMurdie, and Holmes, 2016). This had 12 samples. 1. How does estimation quality improve with sample size? 2. Is it worth gathering unpaired samples? (mosaic designs) .center[ <img src="figures/cca_f1000.gif" width=650/> ] --- ### Example: Estimate Simulators 1. Fit ZINB models to the microbiome. 2. Fit Gaussian models to the metabolome. 3. Join them using highly regularized covariance estimates. .center[ <img src="figures/cca_microbiome_hist.png"/> ] --- ### Part 1: Estimate Simulators 1. Fit ZINB models to the microbiome. 2. Fit Gaussian models to the metabolome. 3. Join them using highly regularized covariance estimates. .center[ <img src="figures/cca_metab_hist.png"/> ] --- ### Part 2: Hypothetical Experiments .pull-left[ 1. Across sample sizes `\(n\)`, we simulate new data, fit sparse CCA, and identify a basis `\(\hat{V}_{n}\)` spanning the `\(K = 5\)` left and right sparse CCA factors 1. We use `\(\hat{V}_{5000}\)` as a reference and compute canonical angles (Golub and Zha, 1994) with all smaller sample sizes. ] .pull-right[ <img src="figures/canonical_angles_paper.png"/> Figure from (Arikawa, 2018). ] --- ### Part 2: Hypothetical Experiments With more samples, the canonical angles shrink predictably. .center[ <img src="figures/canonical_angles.png" width=750/> ] --- ### Mosaic Case Here, we have allowed the sample sizes to differ and use KNN imputation to run CCA on an imputed table. .center[ <img src="figures/mosaic_example.png" width=750/> ] --- ### Aside: Negative Control The default sPLS-DA visualization output can separate classes even when no signals exist. This is a consequence of overfitting. .center[ <img src="figures/t1D-null-overfit.png" width=400/> ] ---