Simulation Workshop

<div id="links">
Slides: <a href="https://go.wisc.edu/v4hsh9">https://go.wisc.edu/v4hsh9</a> <br/>
Colab: <a href="https://go.wisc.edu/1lq21r">https://go.wisc.edu/1lq21r</a><br/>
Quarto: <a href="https://go.wisc.edu/5met6d">https://go.wisc.edu/5met6d</a><br/>
Compiled: <a href="https://go.wisc.edu/8i6ql5">https://go.wisc.edu/8i6ql5</a>
</div>
<div id="title">
Simulation in Omics
</div>
<br/>
<br/>

<i>As the program notes say, "Dive into the gene pool to see evolution in action," and enjoy "genetic engineering in the privacy of your own home."

Requires: 2.5 megabytes of system memory or 3 megabytes under System 7, and a hard drive.</i>
<br/>
-- The New York Times, 1992, reviewing the newly released SimLife

<i>Heard melodies are sweet, but those unheard <br/>
Are sweeter; therefore, ye soft pipes, play on... <br/></i>
-- Keats

<div id="subtitle">
Kris Sankaran <br/>
Methods for Biological Data<br/>
07 | February | 2024 <br/>
</div>

---

### Learning Outcomes

1. Discuss the role of semisynthetic simulation in guiding sequencing data
analysis, including for power analysis and methods benchmarking.

1. Create, modify, and critique simulators using the `scDesign3` and
`scDesigner` packages.

1. Explain and apply basic techniques for microbiome differential testing and
network inference.

1. Manipulate and explain the structure of `SummarizedExperiment` objects in the
Bioconductor software ecosystem.

---

### Schedule

1. Intro Presentation - 15 minutes
1. Simulation Quick Start - 30 minutes
1. Discussion - 10 minutes
1. Advanced Example(s) - 35 minutes

---

### Why Simulate?

* **Experimental Design**: We have to decide on cohorts, longitudinal sampling plans, and sequencing technologies, not to mention sample sizes.

* **Benchmarking**: They allow us to benchmark algorithms even when ground truth labels are unavailable.

* **Data Augmentation**: Simulated samples can improve algorithmic performance and integration.

* **Calibration**: Statistical performance on simulated data can help us calibrate workflows to improve power and control the false discovery rate.

---

### Why Simulate?

* **Experimental Design**: We have to decide on cohorts, longitudinal sampling plans, and sequencing technologies, not to mention sample sizes.

* **Benchmarking**: They allow us to benchmark algorithms even when ground truth labels are unavailable.

* **Data Augmentation**: Simulated samples can improve algorithmic performance and integration.

* **Calibration**: Statistical performance on simulated data can help us calibrate workflows to improve power and control the false discovery rate.

<g style="font-size: 20px; margin: 0; line-height: 30px; display: block;">
Examples: BASICS,
compcodeR,
deconvR,
dropsim,
ESCO,
FreeHi-C,
FreeHiCLite,
hierarchicell,
kersplat,
metaART,
MOSim,
MSstatsSampleSize,
MIDAS,
Mimesys,
multiomics_networks_simulation,
muscat,
powsimR,
POWSC,
SCDD,
scDesign3,
SCRIP,
Sim3C,
SimATAC,
SimFFPE,
sismonr,
spaSim,
sparseDOSSA,
Splat,
SPARSim,
SPsimSeq,
SparseDC,
SymSim,
ZINB-WaVE,
zingeR, ...
</g>

---

### Experimental Design

Let's consider a microbiome case study: To block or not to block?

* Blocking removes person-level effects...
* ...but increases participant burden.

Figure from Sankaran and Holmes (2023).

---

### Simulation to the Rescue

How can we navigate trade-offs like this? Simulate!

.center[
<img src="figure/blocking.jpeg" width=840/>
]

Simulators provide data for more precise decision-making.

---

### Inference

By learning a discriminator to contrast real vs. simulated data, we can
systematically improve the assumed generative mechanism.

.center[
<img src="figure/iterability.jpeg" width=730/>
]

---

### Live Demo

Live Demo: https://go.wisc.edu/6hw050

Colab: https://go.wisc.edu/1lq21r

Quarto: https://go.wisc.edu/5met6d

Compiled: https://go.wisc.edu/g63hy9

---

### Discussion Prompts

1. What are some of the data analyses that you care about the most? What are the
questions you are trying to answer, and do you think there might be a role for
simulation in your analysis workflow?

1. What have been some of the bottlenecks or tradeoffs that you have encountered
in managing, modeling, or drawing inferences from your data? How might
simulation be used to navigate these issues?

---

## `scDesign` Mechanics

---

We will focus on the `scDesign` family of simulators. Let's review that
package's approach.

.center[
<img src="figure/scdesignOverview.png" width=800/>
]

First, we estimate models `\(\hat{F}_{g}\left(y_{i} \vert \mathbf{x}_{i}\right)\)` for each gene `\(g\)`.

* Can use a variety of families: Gaussian, Poisson, Negative Binomial,...
* Can learn relationships for each parameter `\(\theta\left(\mathbf{x}_{i}\right)\)`.

---

### `scDesign3` Review

.pull-left[
<img src="figure/scdesignOverview2.png" width=400/>
]

.pull-right[
1. Next, we model the joint distribution of quantiles using a copula model.

1. This correlates genes even after conditioning on the same `\(\mathbf{x}_{i}\)`.
]

---

### Copula Example

Here is a pair of uncorrelated Poisson samples. How can we correlate them?
correlated?

.pull-left[

```r
x <- matrix(rpois(100 * 2, 10), 
            ncol = 2)
head(x)
```

```
##      [,1] [,2]
## [1,]   13   17
## [2,]   18   12
## [3,]   14    9
## [4,]    8   10
## [5,]   12   12
## [6,]   11   10
```
]

.pull-right[

```
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## i Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
```

<img src="20240207_files/figure-html/unnamed-chunk-3-1.png" width="420" />
]

---

### Copula Example

First, let's generate some correlated Gaussians.

.pull-left[

```r
Sigma <- matrix(
  c(1, 0.95, 0.95, 1), 
  nrow = 2
)
z <- rmvnorm(100, sigma = Sigma)
head(z)
```

```
##             [,1]       [,2]
## [1,]  0.07472686 -0.3382285
## [2,] -0.40734754 -0.4967710
## [3,] -1.32092449 -1.7778988
## [4,]  0.81134803  0.7444313
## [5,]  0.47832165  0.1779981
## [6,] -1.14137843 -1.3818025
```
]

.pull-right[
<img src="20240207_files/figure-html/unnamed-chunk-5-1.png" width="450" />
]

---

### Copula Example

Let's transform each marginal into uniform variates.

.pull-left[

```r
u <- pnorm(z)
head(u)
```

```
##            [,1]       [,2]
## [1,] 0.52978398 0.36759550
## [2,] 0.34187637 0.30967529
## [3,] 0.09326327 0.03771024
## [4,] 0.79141708 0.77169221
## [5,] 0.68378936 0.57063775
## [6,] 0.12685624 0.08351618
```
]

.pull-right[
<img src="20240207_files/figure-html/unnamed-chunk-7-1.png" width="450" />
]

---

### Copula Example

We can transform these uniforms back into Poisson variables. Now they are correlated.

.pull-left[

```r
x_new <- qpois(u, lambda = 10)
head(x_new)
```

```
##      [,1] [,2]
## [1,]   10    9
## [2,]    9    8
## [3,]    6    5
## [4,]   12   12
## [5,]   11   10
## [6,]    6    6
```
]

.pull-right[
<img src="20240207_files/figure-html/unnamed-chunk-9-1.png" width="450" />
]

---

### References

Sankaran, Kris and Susan P. Holmes. “Generative Models: An Interdisciplinary
Perspective.” Annual Review of Statistics and Its Application (2022): n. pag.