class: title <div id="title"> Semisynthetic Simulation for Biological Data Analysis </div> <div id="under_title"> Session 2: Multivariate Analysis </div> <div id="subtitle"> Kris Sankaran <br/> 18 | June | 2024 <br/> Lab: <a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> <br/> </div> <div id="subtitle_right"> Melbourne Integrative Genomics<br/> Slides: <a href="https://go.wisc.edu/rc776i">go.wisc.edu/rc776i</a><br/> Code: <a href="https://go.wisc.edu/o5sn6w">go.wisc.edu/o5sn6w</a> </div> --- ### Today's Learning Outcomes By the end of this session, you will be able to... 1. Identify situations in your own research where multivariate methods can be used and simulation can help. 1. Discuss how latent variable and copula models induce correlation across collections of features. 1. Design and implement power and benchmarking analyses for multivariate models. --- class: middle .center[ ## Scientific Context ] --- ### Multivariate Modeling * Features in high-throughput biological data are strongly correlated. Focusing on marginals alone would keep us from understanding important relationships. * Multivariate thinking can inform higher-level conceptual categories: regulatory modules, microbial communities, cell types, ... .center[ <img src="figure/eisen.png" width=650/> <br/> <span style="font-size: 20px"> A heatmap of microarray data from [1] </span> ] --- ### Examples Analyses These methods are often based on some form of matrix factorization, and they are ubiquitous in modern genomics: * Clustering: Cancer subtypes, cell atlases [2; 3]. * Dimensionality Reduction: Microbiome communities [4; 5], pseudotime analysis [6; 7] * Network Analysis: Gene regulatory networks [8; 9]. <span style="font-size: 20px"/> Example Methods: Multidimensional Scaling, Principal Components Analysis, Uniform Manifold Approximation and Projection, Nonnegative Matrix Factorization, Mixed-Membership Modeling, Stochastic Blockmodel, ... </span> --- ### Role of Simulation Simulation can be valuable even in this multivariate setting. * **Power Analysis**: How does the likely number of discoveries vary as we vary the sample size and experimental design? * **Benchmarking**: For a given multivariate analysis output, what are tradeoffs between candidate methods? * **Calibration**: Can conclusions be adjusted to ensure that target false discovery rates are closely met on synthetic data? --- ### Discussion Please discuss in groups of 2 - 4: * In your research, where can you use multivariate methods? * How might simulation assist that analysis? We will debrief responses as a group. --- class: middle .center[ ## Statistical Background ] --- ### Latent Variables This approaches identifies low-dimensional profiles that underlie all observed variation. .center[ <img src="figure/latent_approaches.png" width=400/> ] --- ### Matrix Factorization This can be accomplished using a form of matrix factorization [10; 11]: `\begin{align*} x_{i} \approx A z_{i} \end{align*}` * In clustering, `\(z_{i}\)` encode which cluster sample `\(i\)` belongs to. * In PCA, `\(z_i\)` are scores with respect to components `\(A\)`. .center[ <img src="figure/matrix_factorization.png" width=550/> ] --- ### Matrix Factorization This can be accomplished using a form of matrix factorization [10; 11]: `\begin{align*} x_{i} \approx A z_{i} \end{align*}` * In clustering, `\(z_{i}\)` encode which cluster sample `\(i\)` belongs to. * In PCA, `\(z_i\)` are scores with respect to components `\(A\)`. .center[ <img src="figure/clustering.png" width=550/> ] --- ### Matrix Factorization This can be accomplished using a form of matrix factorization [10; 11]: `\begin{align*} x_{i} \approx A z_{i} \end{align*}` * In clustering, `\(z_{i}\)` encode which cluster sample `\(i\)` belongs to. * In PCA, `\(z_i\)` are scores with respect to components `\(A\)`. .center[ <img src="figure/factorization.png" width=550/> ] --- ### Implications This suggests a simulation strategy: .pull-left[ * Learn `\(\mathbf{z}_{i}\)` from real data. * For each `\(d\)`, learn the marginal `\(F_{d}\left(x_{id} \vert \mathbf{z}_{i}\right)\)` using GAMLSS. * To generate new `\(\mathbf{x}^{\ast}\)`, first simulate `\(\mathbf{z}^\ast\)` and sample from `\(\left[F_{1}\left(\cdot \vert \mathbf{z}^\ast\right), \dots, F_{D}\left(\cdot \vert \mathbf{z}^\ast\right)\right]\)` ] .pull-right[ <img src="figure/clustering_pushforward.png" width=500/> ] --- ### Copula Models These are a type of model that "couple" a collection of known marginal distributions [12; 13; 14]. .center[ <img src="figure/gene-gene_dependence.png" width=900/> ] --- ### Starting Point **Question**: If we were asked to simulate a vector of five correlated variables on our computers right now, what would be the easiest thing to do? --- ### Starting Point **Question**: If we were asked to simulate a vector of five correlated variables on our computers right now, what would be the easiest thing to do? ``` r library(mvtnorm) D <- 5 ones <- rep(1, D) Sigma <- 0.01 * diag(D) + 0.99 * ones %*% t(ones) rmvnorm(3, rep(0, D), Sigma) ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.0009511073 0.09897214 0.2344859 0.2382537 0.08975082 ## [2,] 1.2676617829 1.30234917 1.2477393 1.3278895 1.29604245 ## [3,] 2.2345516427 2.23787330 2.3443253 2.1891693 2.39399368 ``` The difficulty is that we usually want non-Gaussian margins `\(F_{1}, \dots, F_{D}\)`. --- ### Intuition * In the Gaussianized space, it's easy to model correlation. * The mapping back and forth is possible because we know the margins `\(F\)`. - `\(\Phi\)` represents the Gaussian CDF applied componentwise <br/> <br/> .center[ <img src="figure/copula_transformation.png" width=700/> ] --- ### Copula Models More formally, let `\(F_{1}, \dots, F_{D}\)` be the target margins and let `\(\Phi\)` be the CDF of the Gaussian distribution. Gaussian Copula modeling has these steps. Estimate: 1. Gaussianize the observed `\(\mathbf{x}_{i}\)` to `\(\mathbf{z}_{i} := \left[\Phi^{-1}\left(F_{1}\left(x_{i1}\right)\right), \dots, \Phi^{-1}\left(F_{D}\left(x_{iD}\right)\right)\right]\)` 1. Estimate the covariance `\(\hat{\Sigma}\)` associated with `\(z_{i}\)` Simulate: 1. Draw `\(\mathbf{z}^\ast \sim \mathcal{N}\left(0, \Sigma\right)\)` 1. Transform back `\(\mathbf{x}\ast := \left[F_{1}^{-1}\left(\Phi\left(z_{i1}^\ast\right)\right), \dots, F_{D}^{-1}\left(\Phi\left(z_{iD}^\ast\right)\right)\right]\)` --- ### Variations 1. We might expect the corelation structure to vary across groups. This can be accomplished by setting separate `\(\Sigma_{k}\)` across groups `\(k\)`. 1. In high-dimensions, the sample covariance `\(\hat{\Sigma}\)` can destabilize. In this case, we should use high-dimensional covariance estimators [15; 16]. .center[ <img src="figure/copula_groups.png" width=700/> ] --- ### Definition Last session's code actually estimated Gaussian copulas in the background by default. ``` r setup_simulator( exper, ~ group, ~ GaussianLSS(), copula_gaussian() ) ``` --- ### Conditioned Copulas We can allow the covariance to depend on group membership using the same formula syntax we used for the marginals. ``` r setup_simulator( exper, ~ group, ~ GaussianLSS(), copula_gaussian(~ group) ) ``` --- ### Available Copulas * `copula_adaptive`: Adaptive estimation of high-dimensional covariance estimation [16]. * `copula_vine`: Vine copula for matching higher-order moments [17]. * `copula_glasso`: Graphical Lasso for sparse covariance estimation [15]. * `copula_t`: Student's `\(t\)` Copula for better modeling of tail-dependence * `copula_*_t`: Modified covariance estimators applied to `\(t\)` Copula --- ### sPLS-DA Setting Our power analysis uses Sparse Parital Least Squares Discriminant Analysis (sPLS-DA) [18; 19; 20]. This topic is it's own full workshop, but let's review the core ideas. .pull-left[ sPLS-DA helps with prediction when, * s: Not all features are predictive * PLS: Many features are correlated with one another * DA: The response is one of `\(K\)` classes ] .pull-right[ <img src="figure/PLS-setup.png" width=500/> ] --- ### sPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figure/PLS-step1.png" width=500/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### sPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figure/PLS-step2.png" width=240/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### sPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figure/PLS-step3.png" width=400/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### sPLS-DA Intuition We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar. .center[ <img src="figure/PLS-step4.png" width=150/> ] Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize `\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`. --- ### sPLS-DA Intuition Now we can compare samples from the two tables in a single, shared space. .center[ <img src="figure/PLS-step5.png" width=800/> ] --- ### sPLS-DA Intuition Now we can compare samples from the two tables in a single, shared space. .center[ <img src="figure/PLS-step6.png" width=800/> ] --- ### sPLS-DA Intuition To get more than one dimension, we can repeat this process after removing any correlation with previously found patterns. .center[ <img src="figure/PLS-step7.png" width=800/> ] --- ### sPLS-DA: Outputs The blending coefficients tell you how to interpret axes in this space. For example, `\(a^{(1)}_{d} = 0\)` means feature `\(d\)` wasn't used for the first dimension. .pull-left[ <img src="20240618_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="20240618_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> ] Example sPLS-DA output from the `mixOmics` package [20]. --- class: middle .center[ ## Demo + Exercises ] * Code repository: [go.wisc.edu/o5sn6w](https://go.wisc.edu/o5sn6w) * Exercise solutions: [go.wisc.edu/441a60](https://go.wisc.edu/441a60) * Live Link: [go.wisc.edu/mis697](https://go.wisc.edu/mis697) --- ### Summary 1. Multivariate analysis is essential for reasoning about higher-level concepts in biological data. 1. Copula and factor models are general approaches to inducing correlation in simulated data. 1. The `copula_*` functions in `scDesigner` give a few approaches to copula-based simulation 1. Power analysis can be tightly coupled with multivariate algorithms. --- ### Next Time Marginal `\(\to\)` Joint `\(\to\)` <span style="color:#8C1F33">Integrative</span> .center[ <img src="figure/integration_types_3.png" width=600/> ] --- ### References [1] M. B. Eisen, P. T. Spellman, P. O. Brown, et al. "Cluster analysis and display of genome-wide expression patterns". En. In: _Proc. Natl. Acad. Sci. U. S. A._ 95.25 (Dec. 1998), pp. 14863-14868. [2] S. R. Quake. "A decade of molecular cell atlases". En. In: _Trends Genet._ 38.8 (Aug. 2022), pp. 805-810. [3] M. D. Luecken, M. Büttner, K. Chaichoompu, et al. "Benchmarking atlas-level data integration in single-cell genomics". En. In: _Nat. Methods_ 19.1 (Jan. 2022), pp. 41-50. [4] K. Sankaran and S. P. Holmes. "Latent variable modeling for the microbiome". En. In: _Biostatistics_ 20.4 (Oct. 2019), pp. 599-614. --- [5] R. A. Deek and H. Li. "A zero-inflated Latent Dirichlet Allocation model for microbiome studies". En. In: _Front. Genet._ 11 (2020), p. 602594. [6] K. Street, D. Risso, R. B. Fletcher, et al. "Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics". En. In: _BMC Genomics_ 19.1 (Dec. 2018). [7] C. Xia, J. Fan, G. Emanuel, et al. "Spatial transcriptome profiling by MERFISH reveals subcellular RNA compartmentalization and cell cycle-dependent gene expression". En. In: _Proc. Natl. Acad. Sci. U. S. A._ 116.39 (Sep. 2019), pp. 19490-19499. [8] S. Wu, A. Joseph, A. S. Hammonds, et al. "Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks". En. In: _Proc. Natl. Acad. Sci. U. S. A._ 113.16 (Apr. 2016), pp. 4290-4295. --- [9] Z. Duren, X. Chen, M. Zamanighomi, et al. "Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations". En. In: _Proc. Natl. Acad. Sci. U. S. A._ 115.30 (Jul. 2018), pp. 7723-7728. [10] G. L. Stein-O'Brien, R. Arora, A. C. Culhane, et al. "Enter the matrix: Factorization uncovers knowledge from omics". En. In: _Trends Genet._ 34.10 (Oct. 2018), pp. 790-805. [11] K. P. Murphy. _Probabilistic machine learning_. En. London, England: MIT Press, Aug. 2023. [12] H. Joe. _Dependence modeling with copulas_. En. Chapman & Hall/CRC Monographs on Statistics and Applied Probability. London, England: CRC Press, Jan. 2023. --- [13] R. A. Deek and H. Li. "Inference of microbial covariation networks using copula models with mixture margins". En. In: _Bioinformatics_ 39.7 (Jul. 2023). [14] T. Sun, D. Song, W. V. Li, et al. "scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured". En. In: _Genome Biol._ 22.1 (May. 2021), p. 163. [15] J. Friedman, T. Hastie, and R. Tibshirani. "Sparse inverse covariance estimation with the graphical lasso". En. In: _Biostatistics_ 9.3 (Jul. 2008), pp. 432-441. [16] T. Cai and W. Liu. "Adaptive Thresholding for Sparse Covariance Matrix Estimation". In: _J. Am. Stat. Assoc._ 106.494 (Jun. 2011), pp. 672-684. [17] C. Czado and T. Nagler. "Vine copula based modeling". En. In: _Annu. Rev. Stat. Appl._ 9.1 (Mar. 2022), pp. 453-477. [18] K. Lê Cao, D. Rossouw, C. Robert-Granié, et al. "A sparse PLS for variable selection when integrating omics data". En. In: _Stat. Appl. Genet. Mol. Biol._ 7.1 (Nov. 2008), p. Article 35. [19] K. Lê Cao, S. Boitard, and P. Besse. "Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems". En. In: _BMC Bioinformatics_ 12.1 (Jun. 2011), p. 253. [20] F. Rohart, B. Gautier, A. Singh, et al. "mixOmics: An R package for 'omics feature selection and multiple data integration". En. In: _PLoS Comput. Biol._ 13.11 (Nov. 2017), p. e1005752. --- [18] K. Lê Cao, D. Rossouw, C. Robert-Granié, et al. "A sparse PLS for variable selection when integrating omics data". En. In: _Stat. Appl. Genet. Mol. Biol._ 7.1 (Nov. 2008), p. Article 35. [19] K. Lê Cao, S. Boitard, and P. Besse. "Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems". En. In: _BMC Bioinformatics_ 12.1 (Jun. 2011), p. 253. [20] F. Rohart, B. Gautier, A. Singh, et al. "mixOmics: An R package for 'omics feature selection and multiple data integration". En. In: _PLoS Comput. Biol._ 13.11 (Nov. 2017), p. e1005752.