Semisynthetic Simulation for Biological Data Analysis

<div id="title">
Semisynthetic Simulation for Biological Data Analysis
</div>
<div id="under_title">
Session 1: Marginal Modeling
</div>

<div id="subtitle">
Kris Sankaran <br/>
11 | June | 2024 <br/>
Lab: <a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> <br/>
</div>

<div id="subtitle_right">
Melbourne Integrative Genomics<br/>
Slides: <a href="https://go.wisc.edu/gfj36r">go.wisc.edu/gfj36r</a><br/>
Code: <a href="https://go.wisc.edu/o5sn6w">go.wisc.edu/o5sn6w</a>
</div>

---

### Overall Learning Outcomes

By the end of this course, you will be able to...

1. Describe simulators: What are their building blocks and core properties?

1. Apply simulators: Know how to use simulation at different stages of the
multi-omics data analysis to guide workflow design and interpretation.

1. Critique simulators: Compare and contrast real with simulated data using
effective data visualizations.

---

### Tody: Marginal Simulation

<span style="color:#8C1F33">Marginal</span> `\(\to\)` Multivariate `\(\to\)` Integrative

---

### Tody: Marginal Simulation

<span style="color:#8C1F33">Marginal</span> `\(\to\)` Multivariate `\(\to\)` Integrative

---

### Tody: Marginal Simulation

<span style="color:#8C1F33">Marginal</span> `\(\to\)` Multivariate `\(\to\)` Integrative

---

### Today's Learning Outcomes

By the end of this session, you will be able to...

1. Manipulate and interpret `SummarizedExperiment` experiment objects. 
1. Design a power analysis for differential testing and accurately communicate
its results.
1. Identify areas of your own research where simulation could help better
allocate limited resources.

---

### Course Expectations

* Bugs are normal! Resolving them is a skill you will develop.
* Please ask a tutor to help at any point. Don't worry about interrupting.
* We will have breaks and discussions. Get to know the others in the course!

---

---

### The Microbiome

Imagine a collaboration with researchers who study the human gut microbiome --
the ecosystem of microorganisms that live in the gut [1]. Like ordinary
ecology, they want to know:

.pull-left[
* Who is present?
* What are they doing? Which genes are active?
* How does this depend on the host or environmental context?
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="https://whatislife.stanford.edu/images/spatial.png" alt="The microbiome along the gut lining, from [2]." width="250" />
<p class="caption">The microbiome along the gut lining, from [2].</p>
</div>
]

---

### Hypothetical Proposal

They are preparing a grant proposal about how community composition is related
to nutrition. The want to compare the microbiomes associated with malnutrition
and health.

<span style="position: absolute; bottom: 20px">
We need to make sure the budget is wisely while ensuring the study isn't underpowered.
</span>

---

### Problem Setup

This is a differential abundance problem [3; 4]. There are many methods out there that account
for:

* Potentially extreme sparsity and non-Gaussianity.
* Uneven sequencing depth across samples.
* The large number of tests (and the potential to borrow strength).

This is great news but means we can't just use `\(t\)`-test power calculators.

---

### Approach: Simulation

Instead, we simulate. We will simulate experiments and see what our power and
false discovery rate (FDR) look like in those hypothetical datasets.

Abstract power calculation `\(\to\)` Concrete computational experimentation.

---

### Simulation with Templates

* Instead of doing this from scratch, we train a generative
model to existing experimental data ("template data") [5; 6].

* Generative models can _generate_ new, hypothetical samples, not just fit
observed ones [7].

---

### Simulation with Templates

* Instead of doing this from scratch, we train a generative
model to existing experimental data ("template data") [5; 6].

* Generative models can _generate_ new, hypothetical samples, not just fit
observed ones [7].

---

### Simulation with Templates

Abstract power calculation `\(\to\)` Concrete computational experimentation.

---

### Discussion

Please discuss in groups of 2 - 4:

* In your own research, what is one place where simulation might help? 
* What template dataset would you use to guide the simulation?

We will debrief responses as a group.

---

---

### Approach

The richer our vocabulary for generative models, the better chance we'll have of
finding a realistic simulation mechanism. Most modern methods have three parts.

1. **A template dataset on which to base the simulator.**
1. Flexible families of probability distributions.
1. Regressions to relate samples to experimental or biological factors.

---

### Approach

The richer our vocabulary for generative models, the better chance we'll have of
finding a realistic simulation mechanism. Most modern methods have three parts.

1. A template dataset on which to base the simulator.
1. **Flexible families of probability distributions.**
1. Regressions to relate samples to experimental or biological factors.

---

### Approach

The richer our vocabulary for generative models, the better chance we'll have of
finding a realistic simulation mechanism. Most modern methods have three parts.

1. A template dataset on which to base the simulator.
1. Flexible families of probability distributions.
1. **Regressions to relate samples to experimental or biological factors.**

---

### Representing Data

.pull-left[
We can tie together sequencing output, experimental design, and biological
annotation using `SummarizedExperiment` objects [8]. This will
let us concisely learn: Design + Biology `\(\to\)` Sequencing.
]

---

### Important Functions

To work with `SummarizedExperiment` objects, we can use:

* `assay`: Returns a matrix whose rows are sequencing features (e.g., genes, taxa, ...) and whose columns are samples.
* `rowData`: Returns a data.frame that annotates each sequencing features.
* `colData`: Returns a data.frame that annotates each sample.

``` r
library(MIGsim)

data(atlas)
head(colData(atlas))
```

```
## DataFrame with 6 rows and 11 columns
##                age      sex nationality DNA_extraction_method  project diversity bmi_group  subject      time      sample log_depth
##          <integer> <factor>    <factor>              <factor> <factor> <numeric>  <factor> <factor> <numeric> <character> <numeric>
## Sample-1        28   male            US                    NA        1      5.76     obese        1         0    Sample-1   8.93498
## Sample-2        24   female          US                    NA        1      6.06     obese        2         0    Sample-2   9.22503
## Sample-3        52   male            US                    NA        1      5.50     lean         3         0    Sample-3   8.87221
## Sample-5        25   female          US                    NA        1      5.89     lean         5         0    Sample-5   9.39266
## Sample-6        42   male            US                    NA        1      5.53     lean         6         0    Sample-6   8.97639
## Sample-8        27   female          US                    NA        1      5.38     lean         8         0    Sample-8   9.07154
```

---

### Definition

We can use `setup_simulator()` to define a new simulator. This requires:

* The `SummarizedExperiment` template data
* A regression formula relating colData features to the parameters
* The probability model to use (e.g,. Gaussian or Poisson)

``` r
library(scDesigner)
library(gamboostLSS)
sim <- setup_simulator(exper_ts, ~ group, ~ GaussianLSS())
```

---

### Alteration

We can modify a simulator using the `mutate` command.

``` r
sim |>
  mutate(1:3, link = ~ group + time)
```

```
## [Marginals]
## Plan:
## # A tibble: 6 × 3
##     feature              family          link
##   <gene_id>             <distn>        <link>
## 1         1 Gaussian [mu,sigma] ~group + time
## 2         2 Gaussian [mu,sigma] ~group + time
## 3         3 Gaussian [mu,sigma] ~group + time
## 4         4 Gaussian [mu,sigma]        ~group
## 5         5 Gaussian [mu,sigma]        ~group
## 6         6 Gaussian [mu,sigma]        ~group
## 1, 2, 3, and 3 other features need fitting.
## Estimates:
## # A tibble: 0 × 0
## 
## [Dependence]
## 0 NULLs with  features
## 
## [Template Data]
## class: SummarizedExperiment 
## dim: 6 500 
## metadata(0):
## assays(1): counts
## rownames(6): 1 2 ... 5 6
## rowData names(0):
## colnames: NULL
## colData names(2): group time
```

A more realistic example would be to switch to a zero inflated negative binomial
for rare species:

``` r
sim |>
  mutate(any_of(rare_taxa), family = ~ ZINBLSS())
```

---

### Estimation & Sampling

Once a simulator is defined, it can be estimated with `estimate`.

``` r
sim <- sim |>
  estimate(nu = 0.1) # learning rate = 0.1
```

We can simulate new experiments using `sample`.

``` r
sample(sim)
```

```
## class: SummarizedExperiment 
## dim: 6 500 
## metadata(0):
## assays(1): counts_1
## rownames(6): 1 2 ... 5 6
## rowData names(0):
## colnames: NULL
## colData names(2): group time
```

---

### New Data

Alternatively, we can use a new `colData` object.  This is useful for comparing
sample sizes and experimental designs.

``` r
new_design <- expand.grid(
  group = c("A", "B"), 
  time = seq(0, 1, 0.1)
)

sample(sim, new_data = new_design)
```

```
## class: SummarizedExperiment 
## dim: 6 22 
## metadata(0):
## assays(1): counts_1
## rownames(6): 1 2 ... 5 6
## rowData names(0):
## colnames: NULL
## colData names(2): group time
```

---

* Code repository: [go.wisc.edu/o5sn6w](https://go.wisc.edu/o5sn6w)
* Complete Solutions: [go.wisc.edu/v986n5](https://go.wisc.edu/v986n5)
* Live Demo: [go.wisc.edu/0vr50d](https://go.wisc.edu/0vr50d)

---

### Summary

1. Guiding philosophy:

* In experimental biology, controls can help validate data generation.
  * In computational biology, simulation can help validate statistical claims.

1. Power and FDR rates from simulated experiments can guide concrete, computational power analysis.

1. `SummarizedExperiment` streamlines interaction with experimental data. `assay()`, `rowData()`, and `colData()` are useful accessors.

1. To define a simulator in `scDesigner`, use `setup_simulator()` with R's formula syntax, like `~ treatment + time`.

---

### Next Time: Multivariate Simulation

Marginal `\(\to\)` <span style="color:#8C1F33">Multivariate</span> `\(\to\)` Integrative

---

### References

[1] J. A. Gilbert, M. J. Blaser, J. G. Caporaso, J. K. Jansson, S. V. Lynch, and R. Knight. "Current understanding of the human microbiome". En. In: _Nat. Med._ 24.4
(Apr. 2018), pp. 392-400.

[2] K. A. Earle, G. Billings, M. Sigal, J. S. Lichtman, G. C. Hansson, J. E. Elias, M. R. Amieva, K. C. Huang, and J. L. Sonnenburg. "Quantitative imaging of gut
Microbiota spatial organization". En. In: _Cell Host Microbe_ 18.4 (Oct. 2015), pp. 478-488.

[3] H. Li and H. Li. "Introduction to special issue on statistics in microbiome and metagenomics". En. In: _Stat. Biosci._ 13.2 (Jul. 2021), pp. 197-199.

[4] L. Waldron. "Data and statistical methods to analyze the human microbiome". En. In: _mSystems_ 3.2 (Apr. 2018).

---

[5] T. Sun, D. Song, W. V. Li, and J. J. Li. "scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene
correlations captured". En. In: _Genome Biol._ 22.1 (May. 2021), p. 163.

[6] D. Song, Q. Wang, G. Yan, T. Liu, T. Sun, and J. J. Li. "scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics". En. In: _Nat.
Biotechnol._ 42.2 (Feb. 2024), pp. 247-252.

[7] K. Sankaran and S. P. Holmes. "Generative Models: An Interdisciplinary Perspective". In: _Annual Review of Statistics and Its Application_ 10.1 (Mar. 2023), p.
325–352. ISSN: 2326-831X. DOI: [10.1146/annurev-statistics-033121-110134](https://doi.org/10.1146%2Fannurev-statistics-033121-110134).

[8] W. Huber, V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, S. Davis, L. Gatto, T. Girke, et al. "Orchestrating high-throughput genomic
analysis with Bioconductor". En. In: _Nat. Methods_ 12.2 (Feb. 2015), pp. 115-121.

---