Enhancing Microbiome Analysis with Semisynthetic Data

# Enhancing Microbiome Analysis with Semisynthetic Data

<div id="subtitle">
Kris Sankaran <br/>
Plant Pathology Seminar<br/>
18 | February | 2025 <br/>
</div>
<div id="subtitle_right">
Slides: <a href="https://go.wisc.edu/689h7c">go.wisc.edu/689h7c</a><br/>
Lab: <a href="https://measurement-and-microbes.org">measurement-and-microbes.org</a> <br/>
</div>

---

### Microbiome Data

1. A microbiome is a microbe-scale ecosystem.  It can be described by taxonomic
composition, genomic function, and biochemical features.

1. Advances in sequencing technology have made it easier than ever to rapidly
profile these taxonomic and genomic features in a range of sites, including in
the human gut, on plant roots, and in the oceans.

.center[
<span style="font-size: 18px">
<img src="figures/spatial_earle.png" width=455/><br/>
Spatial imaging of a microbial community along the gut lining, from [1].
</span>
]
---

### Statistical Challenges

Developing the data analysis for a microbiome study can be complicated by a
number of factors.

* **Integration**: How should we transform and analyze data across
batches or technologies, each with unique sources of technical variability?

* **Experimental Design**: How should we arrange sampling, assign treatments,
and place controls so that we can have powerful statistical conclusions?

* **Reproducibility**: How can we be sure our conclusions are trustworthy?

---

### Data Analysis Controversy

In June 2024, _Nature_ retracted a paper [2] the claimed
identify microbiome signatures of cancer. This came after one year's worth of
debate [3; 4]
about the data analysis.

---

### Data Analysis Controversy

The "disease signature" was an artifact resulting from the use of a batch effect
correction method.  Before we can understand the nuances of the story, we need
to learn about batch effects and correction methods.

---

### Simulation to Resolve Controversy

Gerry Tonkin-Hill has an excellent re-analysis [5] of the data from [2] which sheds light what were likely the source of the phantom
signals. The first part is a simulation.

---

### The Changing Simulation Landscape

Historically, microbiome researchers have only rarely used simulation in
their data analysis workflow.

.pull-left[
  <img src="figures/noun-frustration-7442748.png" width=70/> The simulators would have to be written from scratch, which requires significant effort.
]

.pull-right[
  <img src="figures/noun-confused-7442754.png" width=70/> Even afterwards, the resulting data may not be realistic enough to use to guide any practical conclusions.
]

I know this firsthand from writing my PhD thesis [6]... but the situation seems to be improving quite rapidly!!

---

### Semisynthetic Data

One of the major advances has been the design of algorithms that can leverage
public data resources, like [7; 8; 9; 10].

* **Semisynthetic Data**: The output from a simulator that has been designed to mimic external, template data. 
* **Template Data**: Previously gathered experimental data that can be used to train a simulator.

---

### New Packages

We also have many more packages that implement these new methods. Here are 6 out
of the 11 packages discussed in our review [11].

### Talk Outline

This talk gives examples of how semisynthetic data can help microbiome data
analysis. It is based on references 
[12; 13; 11].

<hr/>
<div class="outline-container">
  <div class="act-header">
    <span class="tilde"></span> Act I: Benchmarking and Power Analysis <span class="tilde"></span>
  </div>
  <div class="sub-item">Differential Abundance</div>
  <div class="sub-item">Dimensionality Reduction</div>
  
  <div class="interlude">
    <span class="tilde"></span> Interlude on Evaluation <span class="tilde"></span>
  </div>
  
  <div class="act-header">
    <span class="tilde"></span> Act II: Reliability and Mediation <span class="tilde"></span>
  </div>
  <div class="sub-item">Data Integration</div>
  <div class="sub-item">Mediation Analysis</div>
</div>
<hr/>

---

## Act I: Benchmarking and Power Analysis

---

### Differential Abundance

A common question in microbiome analysis is whether a given taxon is more vs.
less abundant in some conditions vs. others. Formally, consider
  * Hypotheses of interest: `\(H_{1}, \dots, H_{M}\)`. Some of them are non-null, but you don't know which.
  * Associated `\(p\)`-values: `\(p_{1}, \dots, p_{M}\)`.

Goal: Reject as many non-null hypotheses as possible while controlling the
_False Discovery Rate_ [14; 15],

`\begin{align*}
\text{FDR} := \mathbf{E}\left[\frac{|\text{False Positives}|}{|\text{Rejections}| \vee 1}\right]
\end{align*}`

---

### Simulation Example

We can define benchmark using our data. In this example, we

1. Train the simulator to mimic 130 genera from the study 
[16], allowing means and variances to depend on BMI
group.

1. Define _computational negative controls_ by removing effects from the
genera with the weakest effects (insigificant at a cutoff `\(q = 0.1\)`).
.center[
<img src="figures/bmi_effect_cartoon.png" width=910/>
]

---

### Example Simulated Data

The semisynthetic data seems to capture group and taxa differences among these
highly abundant taxa.

---

### Benchmarking Analysis

.pull-three-quarters-right[
All methods control the FDR. LIMMA has high variability in performance. Power
only plateaus around `\(n = 1000\)` samples.
]

---

### Implementation

For this simulator, we used a zero-inflated Negative Binomial variant of the
scDesign3 model [17]. For the
abundance of taxon `\(j\)` in sample `\(i\)`, we used:
`\begin{align*}
X_{ij} \sim \text{ZINB}\left(\mu_{g\left(i\right)j}, \varphi_{g\left(i\right)j}, \nu_{j}\right)
\end{align*}`
where `\(g\left(i\right)\)` is the BMI category of sample `\(i\)` and where 
`\(\mu, \varphi\)`, and `\(\nu\)` are mean, dispersion, and zero-inflation parameters,
respectively.

---

### Community-wide Associations

In many problems, we are interested in the relationships across a collection of
taxa. These analysis require more advanced methods, like network 
[18; 19] or dimensionality 
reduction [20] (see figure below) techniques.

.center[
<img src="figures/antibiotic_prototypes.png" width=840/><br/>
<span style="font-size: 18px;">
</span>
]

---

### Motivation: Power Analysis

.pull-left[
1. Power analyses are intended to prevent researchers from embarking on studies
that have very little chance of detecting the hypothesized signals.

1. While there are formulas for certain univariate tests, there aren't any for 
more complex, multivariate models.
]

.pull-right[
<img src="figures/lenth_power_calculator.png" width=500/>
<span style="font-size: 18px;">
Calculator from Russ Lenth's power and sample size webpage [21].
</span>
]

---

### SPLS-DA Setting

Our power analysis uses Sparse Partial Least Squares Discriminant Analysis
(SPLS-DA) [22].
This topic could be its own full workshop, but let's review the core ideas.

* S: Not all features are predictive
* PLS: Many features are correlated with one another
* DA: The response is one of `\(K\)` classes
]

---

### Example Output

In this example, we are comparing mice with and without a mouse model of Type I
diabetes (T1D). SPLS-DA helps us find taxa that distinguish healthy and disease
groups.

.pull-left[
<img src="figures/t1d-true-data.png"/>
]
.pull-right[
<img src="figures/t1d-true-data-factors.png"/>
]

---

### Problem Formulation

How many samples are necessary before this method can recover the
discriminating factors?

* **Estimate**: Train a simulator on the original data.
* **Alter/Sample**: Define negative control taxa with no association with T1D.
* **Gather/Summarize**: Evaluate SPLS-DA performance on semisynthetic data with
varying sample sizes and fractions of negative control taxa.

---

### Bivariate Relationships

Here are example bivariate relationships learned by the simulator. Do you see
anything off?

---

### Power Analysis

These are the results of our simulation experiment across varying sample sizes
and proportions of truly associated taxa. When few taxa are truly predictive,
many more samples are needed.

---

### Copula Models

These are a type of model that "couple" a collection of known marginal
distributions .

---

### Starting Point

If we were asked to simulate a vector of five correlated variables on
our computers right now, what would be the easiest thing to do?

---

### Starting Point

If we were asked to simulate a vector of five correlated variables on
our computers right now, what would be the easiest thing to do?

``` r
library(mvtnorm)
D <- 5
ones <- rep(1, D)
Sigma <- 0.01 * diag(D) + 0.99 * ones %*% t(ones)
rmvnorm(3, rep(0, D), Sigma)
```

```
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] -0.9800037 -0.9979774 -0.9419255 -0.8202038 -0.8749182
## [2,]  1.4060867  1.4352558  1.4819112  1.4008727  1.3636058
## [3,]  0.7326696  0.6326349  0.6412391  0.6804125  0.6181781
```

The difficulty is that we usually want non-Gaussian margins `\(F_{1}, \dots, F_{D}\)`.

---

### Intuition

* In the Gaussianized space, it's easy to model correlation.
* The mapping back and forth is possible because we know the margins `\(F\)`.
  - `\(\Phi\)` represents the Gaussian CDF applied componentwise
<br/>
<br/>

---

### Variations

1. We might expect the corelation structure to vary across groups. This can be
accomplished by setting separate `\(\Sigma_{k}\)` across groups `\(k\)`.

1. In high-dimensions, the sample covariance `\(\hat{\Sigma}\)` can destabilize. In
this case, we should use high-dimensional covariance estimators [23; 24].

---

## Interlude: Evaluating Simulators

---

### Evaluation Taxonomy

To be useful, simulated data need to be realistic. A few differences to be aware of:

* **Narrow/Broad Measures**: Narrow measures focus on small subsets of taxa, while broad measures evaluate community-level properties.

* **Graphical/Quantative**: Some checks are more easily quantifiable.

* **Fit-for-purpose measures**: Evaluation can focus on specific parameter estimates or analysis results.

Different types of realism should have higher priority depending on the
downstream tasks.

---

### Evaluation through Classification

What type of model would you use to simulate data like this?

---

* A natural enough starting point is a Gaussian mixture model with `\(K = 4\)`.
* We can simulate from the fit, but it seems quite far off.
.pull-left[
_Simulated_
<img src="figures/Gaussian (Shared Covarince).png" width="480"/>
]
.pull-right[
_Truth_
<img src="figures/true_mixture.png" width="480"/>
]

---

We make our assessment quantitative using the discriminator idea of [25].

The prediction probabilies below come from a gradient boosting model. Its 
out-of-sample accuracy is 65.5%.
.pull-left[
_Simulated_
<img src="figures/Gaussian (Shared Covarince)-prob.png" width="480"/>
]
.pull-right[
_Truth_
<img src="figures/true-Gaussian (Shared Covariance)-prob.png" width="480"/>
]

---

As a next step, we increase number of components to `\(K = 5\)` and fit different variances per component.

We still over-sample the gap between the two bottom-left clusters, but the GBM
accuracy has dropped to 55.5%.
.pull-left[
_Simulated_<br/>
<img src="figures/Gaussian (Individual Covariance)-prob.png" width="440"/>
]
.pull-right[
_Truth_<br/>
<img src="figures/true-Gaussian (Individual Covariance)-prob.png" width="440"/>
]

---

* We use a mixture of `\(t\)` distributions next.
* GBM accuracy is now 50.6%
  - Unsurprisingly, this is the true mechanism that generated the data.

<img src="figures/Student's t (Individual Covariance)-prob.png" width="500"/>
]
.pull-right[
_Truth_

<img src="figures/true-Student's t (Individual Covariance)-prob.png" width="500"/>
]

---

The discrimination probabilities become closer to 0.5 the more accurate the simulation becomes.

---

## Act II: Reliability and Mediation

---

### Reliability Checks

1. Beyond power and benchmarking analysis, simulations can clarify how to
interpret a complicated workflow.

1. Following the lead of 
[26; 27], we have been
calling this a *reliability check*.  These checks construct hypothetical
scenarios to understand how methods behave.

<div style="margin-left: 100px;">
<span style="font-family: 'Exo 2'; font-size: 18;">
The analysis should not...<br/>
&nbsp;&nbsp;&nbsp;&nbsp;introduce spurious signals.<br/>
&nbsp;&nbsp;&nbsp;&nbsp;give high confidence results on uncertain data.<br/>
&nbsp;&nbsp;&nbsp;&nbsp;yield very different answers on similar datasets.<br/>
&nbsp;&nbsp;&nbsp;&nbsp;drown out subtle effects.<br/>
&nbsp;&nbsp;&nbsp;&nbsp;etc...
</span>
</div>

---

### Vertical Data Integration

To illustrate, let's consider a vertical data integration question 
[28]. These are problems where we get complementary
'omics views of the same samples.

The goal is to prepare a unified analysis which considers relationships across
sources.

---

### ICU Example

.pull-left[
The study [29] used amplicon sequencing data to profile
the bacterial, viral, and fungal composition in the gut microbiome samples from
ICU patients at a hospital, including a subset who were experiencing sepsis.
]

---

### Multiblock SPLS-DA Analysis

Multiblock SPLS-DA generalizes SPLS-DA to incorporate measurements across
multiple tables [30]. With 
`\(\texttt{sepsis} \times \texttt{antibiotics}\)` status as the response
variable, the method outputs the plots below.

---

### Reliability Check

It's not obvious how we should interpret this output. For example, the virus
data must influence the bacteria plot, because the method integrated across
sources, but how strong is the influence?

Some integration methods are more vs. less aggressive than others.

---

### Semisynthetic Data

To calibrate our interpretation, we first fit a simulator using all data. We
then deliberately remove all associations between the bacteria community
profiles and sepsis status.

### Simulation Results

Applying Multiblock SPLS-DA to these data suggests that we are in an "aggressive
integration" regime. 
.center[<img src="figures/multiblock_calibration.png" width=780/>]
A reliability check like this might have helped [2]
realize that their normalization procedure introduced spurious associations.

---

### Model Comparison

It's common to compare models using `\(R^{2}\)` or prediction performance. Less well
known is that we can _also_ use semisynthetic data. This works even when
regression language is insufficient.

---

### Mindfulness Interventions

This type of model comparison was helpful in an ongoing collaboration with Jo
Handelsman (Plant Pathology) and Richie Davidson (Psychology and
Psychiatry). The driving question is: 
<br/>
<br/>
<div style="font-size: 32px; font-family: 'Nunito'; margin-left: 100px;">
Is it possible to improve psychiatric treatment for a patient using knowledge of their microbiome?
</div>
<br/>
<br/>
Indeed, there is growing evidence for a relationship between the microbiome
and psychiatric conditions, both in mouse models and in observational human
studies [31; 32; 33; 34].

---

### Aside: Event in Two Weeks

---

### Pilot Study

.pull-left[
1. We re-analyzed a pilot study from 2021 [35], which
gathered data from 54 participants randomly assigned to either a mindfulness
training intervention or a waitlist control (n = 27 each).

1. The training lasted 2 months. Data were collected at the start, finish, and 2
month follow-up.
]

---

### Mediation Analysis

1. We were concerned that the mindfulness intervention might be affect behavior,
which in turn influences microbiota composition.
2. To explore this, we applied a form of mediation analysis to the 16S
microbiome and survey data .

---

### Mediation Analysis

---

### Estimated Indirect Effects

These figures summarize the paths `\(T \to M \to Y\)`.</br>
(i.e., color `\(\to\)` x-axis `\(\to\)` y-axis).

---

### Synthetic Null Data

We can alter the simulator so that some pathways are "turned off." Estimates
derived from these data provide a reference null distribution

.center[
<img src="figures/mindfulness-altered.png" width=780/><br/>
<span style="font-size: 24px;">
The middle panel comes from a synthetic null: `\(T \nrightarrow M \to Y\)`.
</span>
]

---

### Synthetic Null Hypothesis Testing

We can rank the effects learned from both the real and synthetic null reference
data. The significance threshold is chosen to control the proportion of null
estimates (false positives) that are included among the discoveries.

---

## Conclusion

---

### Software and Resources

All the examples I discussed today can be run from online tutorials we've
written to accompany our papers:

* Simulation for Microbiome Analysis ([go.wisc.edu/wnj5p9](https://go.wisc.edu/wnj5p9))
* Generative Models Examples ([go.wisc.edu/ax73qb](https://go.wisc.edu/ax73qb))

The relevant R packages behind these analyses are:

* `multimedia` - Mediation analysis for microbiome data.
* `scDesign3` - An existing simulator for single cell data.
* `scDesigner` - Under-development version used in the first tutorial.

---

Simulation can turn a problem of logic into one of observation.

P.S. We need your help! We are looking for more examples to include as
simulation workflows in `scDesigner`. If you have data or problems that could
benefit from simulation, please reach out.

---

* Contact: ksankaran@wisc.edu
* Lab Members: Margaret Thairu, Shuchen Yan, Yuliang Peng, Helena Huang
* Funding: NIGMS R01GM152744, NIAID R01AI184095
* Co-authors: Hanying Jiang, Xinran Miao, Mara Beebe, Dan W. Grupe, Richie
Davidson, Jo Handelsman, Saritha Kodikara, Jingyi Jessica Li, Kim-Anh Lê Cao,
Susan Holmes

---

### References

[1] K. A. Earle et al. "Quantitative Imaging of Gut Microbiota Spatial Organization". In: _Cell Host &amp; Microbe_ 18.4 (Oct. 2015), p. 478–488. ISSN: 1931-3128. DOI:
[10.1016/j.chom.2015.09.002](https://doi.org/10.1016%2Fj.chom.2015.09.002). URL:
[http://dx.doi.org/10.1016/j.chom.2015.09.002](http://dx.doi.org/10.1016/j.chom.2015.09.002).

[2] G. D. Poore et al. "RETRACTED ARTICLE: Microbiome analyses of blood and tissues suggest cancer diagnostic approach". In: _Nature_ 579.7800 (Mar. 2020), p. 567–574.
ISSN: 1476-4687. DOI: [10.1038/s41586-020-2095-1](https://doi.org/10.1038%2Fs41586-020-2095-1). URL:
[http://dx.doi.org/10.1038/s41586-020-2095-1](http://dx.doi.org/10.1038/s41586-020-2095-1).

[3] A. Gihawi et al. "Major data analysis errors invalidate cancer microbiome findings". In: _mBio_ 14.5 (Oct. 2023). Ed. by I. B. Zhulin. ISSN: 2150-7511. DOI:
[10.1128/mbio.01607-23](https://doi.org/10.1128%2Fmbio.01607-23). URL: [http://dx.doi.org/10.1128/mbio.01607-23](http://dx.doi.org/10.1128/mbio.01607-23).

[4] G. D. Sepich-Poore et al. "Reply to: Caution Regarding the Specificities of Pan-Cancer Microbial Structure".  (Feb. 2023). DOI:
[10.1101/2023.02.10.528049](https://doi.org/10.1101%2F2023.02.10.528049). URL: [http://dx.doi.org/10.1101/2023.02.10.528049](http://dx.doi.org/10.1101/2023.02.10.528049).

[5] G. Tonkin-Hill. _GitHub - gtonkinhill/TCGA\_analysis - github.com_. <https://github.com/gtonkinhill/TCGA_analysis>. [Accessed 21-06-2024]. 2023.

[6] K. Sankaran. " Discovery and visualization of latent structure with applications to the microbiome". PhD thesis. Stanford University, 2018. URL:
[https://purl.stanford.edu/nx110xz3452](https://purl.stanford.edu/nx110xz3452).

[7] E. Pasolli et al. "Accessible, curated metagenomic data through ExperimentHub". In: _Nature Methods_ 14 (2017), pp. 1023-1024. URL:
[https://api.semanticscholar.org/CorpusID:3403081](https://api.semanticscholar.org/CorpusID:3403081).

[8] E. Muller et al. "The gut microbiome-metabolome dataset collection: a curated resource for integrative meta-analysis". In: _npj Biofilms and Microbiomes_ 8.1 (Oct.
2022). ISSN: 2055-5008. DOI: [10.1038/s41522-022-00345-5](https://doi.org/10.1038%2Fs41522-022-00345-5). URL:
[http://dx.doi.org/10.1038/s41522-022-00345-5](http://dx.doi.org/10.1038/s41522-022-00345-5).

[9] Felix G.M. Ernst <felix.gm.ernst@outlook.com> [aut, cre] (<https://orcid.org/0000-0001-5064-0928>), Leo Lahti [aut] (<https://orcid.org/0000-0001-5537-637X>),
Sudarshan Shetty <sudarshanshetty9@gmail.com> [aut] (<https://orcid.org/0000-0001-7280-9915>). _microbiomeDataSets_. 2021. DOI:
[10.18129/B9.BIOC.MICROBIOMEDATASETS](https://doi.org/10.18129%2FB9.BIOC.MICROBIOMEDATASETS). URL:
[https://bioconductor.org/packages/microbiomeDataSets](https://bioconductor.org/packages/microbiomeDataSets).

[10] _Home - National Microbiome Data Collaborative - microbiomedata.org_. <https://microbiomedata.org/>. [Accessed 17-02-2025].

[11] K. Sankaran et al. "Semisynthetic simulation for microbiome data analysis". En. In: _Brief. Bioinform._ 26.1 (Nov. 2024).

[12] K. Sankaran et al. "Generative Models: An Interdisciplinary Perspective". In: _Annual Review of Statistics and Its Application_ 10.1 (Mar. 2023), p. 325–352. ISSN:
2326-831X. DOI: [10.1146/annurev-statistics-033121-110134](https://doi.org/10.1146%2Fannurev-statistics-033121-110134). URL:
[http://dx.doi.org/10.1146/annurev-statistics-033121-110134](http://dx.doi.org/10.1146/annurev-statistics-033121-110134).

[13] H. Jiang et al. "Multimedia: multimodal mediation analysis of microbiome data". In: _Microbiology Spectrum_ 13.2 (Feb. 2025). Ed. by J. Claesen. ISSN: 2165-0497.
DOI: [10.1128/spectrum.01131-24](https://doi.org/10.1128%2Fspectrum.01131-24). URL:
[http://dx.doi.org/10.1128/spectrum.01131-24](http://dx.doi.org/10.1128/spectrum.01131-24).

---

### References

[14] Y. Benjamini et al. "Controlling the false discovery rate: a practical and powerful approach to multiple testing". In: _Journal of the Royal statistical society:
series B (Methodological)_ 57.1 (1995), pp. 289-300.

[15] B. Efron. _Large-scale inference: empirical Bayes methods for estimation, testing, and prediction_. Vol. 1. Cambridge University Press, 2012.

[16] L. Lahti et al. "Tipping elements in the human intestinal ecosystem". In: _Nature Communications_ 5.1 (Jul. 2014). ISSN: 2041-1723. DOI:
[10.1038/ncomms5344](https://doi.org/10.1038%2Fncomms5344). URL: [http://dx.doi.org/10.1038/ncomms5344](http://dx.doi.org/10.1038/ncomms5344).

[17] D. Song et al. "scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics". En. In: _Nature Biotechnology_ 42.2 (May. 2023), p.
247–252. DOI: [10.1038/s41587-023-01772-1](https://doi.org/10.1038%2Fs41587-023-01772-1). URL:
[http://dx.doi.org/10.1038/s41587-023-01772-1](http://dx.doi.org/10.1038/s41587-023-01772-1).

[18] Y. Shen et al. "Estimating sparse direct effects in multivariate regression with the spike-and-slab LASSO". In: _Bayesian Anal._ -1.-1 (Jan. 2024).

[19] Y. Shen et al. "The effect of the prior and the experimental design on the inference of the precision matrix in Gaussian chain graph models". En. In: _J. Agric.
Biol. Environ. Stat._ (May. 2024).

[20] K. Sankaran et al. "Latent variable modeling for the microbiome". In: _Biostatistics_ 20.4 (Jun. 2018), p. 599–614. ISSN: 1468-4357. DOI:
[10.1093/biostatistics/kxy018](https://doi.org/10.1093%2Fbiostatistics%2Fkxy018). URL:
[http://dx.doi.org/10.1093/biostatistics/kxy018](http://dx.doi.org/10.1093/biostatistics/kxy018).

[21] R. Lenth. _Java applets for power and sample size - homepage.divms.uiowa.edu_. <https://homepage.divms.uiowa.edu/~rlenth/Power/index.html>. [Accessed 17-02-2025].

[22] F. Rohart et al. "mixOmics: An R package for 'omics feature selection and multiple data integration". En. In: _PLoS Comput. Biol._ 13.11 (Nov. 2017), p. e1005752.

[23] J. Friedman et al. "Sparse inverse covariance estimation with the graphical lasso". En. In: _Biostatistics_ 9.3 (Jul. 2008), pp. 432-441.

[24] T. Cai et al. "Adaptive Thresholding for Sparse Covariance Matrix Estimation". In: _J. Am. Stat. Assoc._ 106.494 (Jun. 2011), pp. 672-684.

[25] J. Friedman. _On multivariate goodness-of-fit and two-sample testing_. Tech. rep. Citeseer, 2004.

[26] D. Song et al. "PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data".
In: _Genome biology_ 22.1 (2021), p. 124.

[27] D. Song. "Improving Statistical Rigor in Single-Cell and Spatial Omics". PhD thesis. University of California, Los Angeles, 2024.

[28] K. Lê Cao et al. "Community-wide hackathons to identify central themes in single-cell multi-omics". In: _Genome biology_ 22 (2021), pp. 1-21.

---

### References

[29] B. W. Haak et al. "Integrative transkingdom analysis of the gut microbiome in antibiotic perturbation and critical illness". En. In: _mSystems_ 6.2 (Mar. 2021).

[30] A. Singh et al. "DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays". In: _Bioinformatics_ 35.17 (Jan. 2019). Ed. by I.
Birol, p. 3055–3062. ISSN: 1367-4811. DOI: [10.1093/bioinformatics/bty1054](https://doi.org/10.1093%2Fbioinformatics%2Fbty1054). URL:
[http://dx.doi.org/10.1093/bioinformatics/bty1054](http://dx.doi.org/10.1093/bioinformatics/bty1054).

[31] L. H. Morais et al. "The gut microbiota-brain axis in behaviour and brain disorders". In: _Nature Reviews Microbiology_ 19.4 (2021), pp. 241-255.

[32] J. A. Bosch et al. "The gut microbiota and depressive symptoms across ethnic groups". En. In: _Nature Communications_ 13.1 (Dec. 2022), p. 7129. ISSN: 2041-1723.
DOI: [10.1038/s41467-022-34504-1](https://doi.org/10.1038%2Fs41467-022-34504-1). URL:
[https://www.nature.com/articles/s41467-022-34504-1](https://www.nature.com/articles/s41467-022-34504-1) (visited on 12/11/2022).

[33] J. A. Foster et al. "Gut-brain axis: how the microbiome influences anxiety and depression". In: _Trends in Neurosciences_ 36.5 (May. 2013), pp. 305-312.

[34] P. Zheng et al. "The gut microbiome modulates gut-brain axis glycerophospholipid metabolism in a region-specific manner in a nonhuman primate model of depression".
In: _Molecular psychiatry_ 26.6 (2021), pp. 2380-2392.

[35] D. W. Grupe et al. "The Impact of Mindfulness Training on Police Officer Stress, Mental Health, and Salivary Cortisol Levels". In: _Frontiers in Psychology_ 12 (Sep.
2021). ISSN: 1664-1078. DOI: [10.3389/fpsyg.2021.720753](https://doi.org/10.3389%2Ffpsyg.2021.720753). URL:
[http://dx.doi.org/10.3389/fpsyg.2021.720753](http://dx.doi.org/10.3389/fpsyg.2021.720753).

---

### Evaluation Taxonomy

This is how some common techniques fall into this taxonomy.

* **Graphical, Narrow**: Boxplots or cumulative distribution function plots comparing real vs. simulated taxa (like the DA example).

* **Graphical, Broad**: Principal component plots of real vs. simulated dataset.

* **Quantitative, Narrow**: Two-sample Kolmogorov-Smirnov test.

* **Quantitative, Broad**: Evaluation through classification (next example).

* **Fit-for-Purpose**: Linear model coefficients on real vs. simulated data ([11], "Batch Effect Correction").

---

### SPLS-DA Intuition

We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar.

Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize
`\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`.

---

### SPLS-DA Intuition

We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar.

Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize
`\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`.

---

### SPLS-DA Intuition

We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar.

Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize
`\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`.

---

### SPLS-DA Intuition

We "blend" columns of `\(\mathbf{X}\)` and `\(\mathbf{Y}\)` within tables until the patterns look similar.

Roughly, choose weights `\(\mathbf{a}\)` and `\(\mathbf{b}\)` to maximize
`\(\text{cor}\left(\mathbf{Xa}, \mathbf{Yb}\right)\)`.

---

### SPLS-DA Intuition

Now we can compare samples from the two tables in a single, shared space.

---

### SPLS-DA Intuition

Now we can compare samples from the two tables in a single, shared space.

---

### SPLS-DA Intuition

To get more than one dimension, we can repeat this process after removing any
correlation with previously found patterns.

---

### Copula Models

More formally, let `\(F_{1}, \dots, F_{D}\)` be the target margins and let `\(\Phi\)` be
the CDF of the Gaussian distribution. Gaussian Copula modeling has these steps.

Estimate:

1. Gaussianize the observed `\(\mathbf{x}_{i}\)` to `\(\mathbf{z}_{i} := \left[\Phi^{-1}\left(F_{1}\left(x_{i1}\right)\right), \dots, \Phi^{-1}\left(F_{D}\left(x_{iD}\right)\right)\right]\)`
1. Estimate the covariance `\(\hat{\Sigma}\)` associated with `\(z_{i}\)`

Simulate:

1. Draw `\(\mathbf{z}^\ast \sim \mathcal{N}\left(0, \Sigma\right)\)` 
1. Transform back `\(\mathbf{x}^{\ast} := \left[F_{1}^{-1}\left(\Phi\left(z_{i1}^\ast\right)\right), \dots, F_{D}^{-1}\left(\Phi\left(z_{iD}^\ast\right)\right)\right]\)`

---

---

---

### Acknowledgments

frustration by Rikas Dzihab from <a href="https://thenounproject.com/browse/icons/term/frustration/" target="_blank" title="frustration Icons">Noun Project</a> (CC BY 3.0)
confused by Rikas Dzihab from <a href="https://thenounproject.com/browse/icons/term/confused/" target="_blank" title="confused Icons">Noun Project</a> (CC BY 3.0)