Expressive Interfaces for Multi-Omics Simulation

class: title
background-size: cover

<div id="links">
Melbourne Integrative Genomics Seminar 
Slides: <a href="https://go.wisc.edu/ux3oc1">go.wisc.edu/ux3oc1</a>
</div>
<div id="title">
Expressive Interfaces for Multi-omics Simulation
</div>

As the program notes say, "Dive into the gene pool to see evolution in action," and enjoy "genetic engineering in the privacy of your own home."

Requires: 2.5 megabytes of system memory or 3 megabytes under System 7, and a hard drive.

-- The New York Times, 1992, reviewing the newly released SimLife

Heard melodies are sweet, but those unheard 
Are sweeter; therefore, ye soft pipes, play on... 
-- Keats

<div id="subtitle">
Kris Sankaran 
<a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> 
31 | May | 2024 
</div>

---

### Why Simulate?

* **Experimental Design**: We have to decide on cohorts, longitudinal sampling plans, and sequencing technologies, not to mention sample sizes.

* **Benchmarking**: They allow us to benchmark algorithms even when ground truth labels are unavailable.

* **Data Augmentation**: Simulated samples can improve algorithmic performance and integration.

* **Calibration**: Statistical performance on simulated data can help us calibrate workflows to improve power and control the false discovery rate.

---

### Why Simulate?

* **Experimental Design**: We have to decide on cohorts, longitudinal sampling plans, and sequencing technologies, not to mention sample sizes.

* **Benchmarking**: They allow us to benchmark algorithms even when ground truth labels are unavailable.

* **Data Augmentation**: Simulated samples can improve algorithmic performance and integration.

* **Calibration**: Statistical performance on simulated data can help us calibrate workflows to improve power and control the false discovery rate.

<g style="font-size: 20px; margin: 0; line-height: 30px; display: block;">
Examples: 
BEERS2,
compcodeR,
deconvR,
dropsim,
ESCO,
FreeHi-C,
FreeHiCLite,
hierarchicell,
kersplat,
metaART,
MIDAS,
Mimesys,
MOSim,
MSstatsSampleSize,
multiomics_networks_simulation,
muscat,
polyester,
powsimR,
POWSC,
SCDD,
scDesign3,
SCRIP,
Sim3C,
SimATAC,
SimFFPE,
sismonr,
spaSim,
sparseDOSSA,
SparseDC,
SPARSim,
SPsimSeq,
Splat,
SymSim,
ZINB-WaVE,
zingeR
</g>

---

### A Grand Challenge

*Many single-cell data analysis packages include their own ad hoc data simulators. However, these simulators are usually not available as separate tools or even as a source code, tailored to specific problems studied in corresponding papers and sometimes not comprehensively documented, thus limiting their utility for the broad research community.*

-- Challenge 11 from "Eleven grand challenges in single-cell data science" (Lahnemann et al., 2020).

---

### That's a Bit Uncharitable

1. It's true that most researchers do not routinely use simulation. The issue
isn't sloppiness in sharing or documenting code, though.

1. The deeper issues are:

* Interfaces often suffer from large interaction gulfs (Hutchins et al., 1985). It is hard to specify simulators and iterate.

.center[
<img src="figures/interaction_gulfs.png" width=600/>
]

---

### That's a Bit Uncharitable

1. It's true that most researchers do not routinely use simulation. The issue
isn't sloppiness in sharing or documenting code, though.

1. The deeper issues are:
  * We usually view simulators as monolithic. We can
  adjust parameters but can't interchange components. This makes them hard to
  adapt to new technologies or applications.

.center[
<img src="figures/grammar_of_graphics.png" width=650/>

 
Modular decomposition of a plot according to the grammar of graphics (Wilkinson, 2005). 

]

---

### Existing Interfaces

This is from the [`splatter` introductory vignette](https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splatter.html#2_Quickstart) (Zappia et al., 2017). 
Of the packages I've listed, it has most thoughtful interface.

``` r
params <- setParams(params, mean.shape = 0.5, de.prob = 0.2)
params
#> A Params object of class SplatParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (GENES) (Cells) [SEED] 
#> 8000 100 694289 
#> 
#> 29 additional parameters 
#> 
#> Batches: 
#> [Batches] [Batch Cells] [Location] [Scale] [Remove] 
#> 1 100 0.1 0.1 FALSE 
#> 
#> Mean: 
#> (RATE) (SHAPE) 
#> 0.5 0.5 
```

---

### Existing Interfaces

.pull-left[
Here is an example from [`scDesign3`](https://songdongyuan1994.github.io/scDesign3/docs/articles/scDesign3.html) (Song et al., 2023).
This package has the most generally applicable statistical methodology.
]

.pull-right[

``` r
example_simu <- scdesign3(
 sce = example_sce,
 assay_use = "counts",
 celltype = "cell_type",
 pseudotime = "pseudotime",
 spatial = NULL,
 other_covariates = NULL,
 mu_formula = "s(pseudotime, k = 10, bs = 'cr')",
 sigma_formula = "1", # If you want your dispersion also varies along pseudotime, use "s(pseudotime, k = 5, bs = 'cr')"
 family_use = "nb",
 n_cores = 2,
 usebam = FALSE,
 corr_formula = "1",
 copula = "gaussian",
 DT = TRUE,
 pseudo_obs = FALSE,
 return_model = FALSE,
 nonzerovar = FALSE
 )
```
]

---

<g style="font-size: 36px">
Main Idea: Apply interactive computing principles to multi-omics simulation.
</g>

<img src="figures/Lego_Brick_4.svg" width=50/> Modularity: We should be able to build
problem-specific simulators by composing simple pieces.

<img src="figures/computer_mouse.png" width=35/> Interactivity: We should give domain
researchers agency in designing, evaluating, and modifying statistical
hypotheses.

<img src="figures/speech_bubbles.png" width=70/> Cooperation: Science is a social activity, and good
tools allow researchers to build upon one another's work.

---

### `scDesign3` Review

The rest of the talk will be about using these principles to develop a new
version of `scDesign3`. Let's review that package's approach.

.center[
<img src="figures/scdesignOverview.png" width=800/>
]

First, we estimate models `$\hat{F}_{g}\left(y_{i} \vert \mathbf{x}_{i}\right)$` for each gene `$g$`.

* Can use a variety of families: Gaussian, Poisson, Negative Binomial,...
* Can learn relationships for each parameter `$\theta\left(\mathbf{x}_{i}\right)$`.

---

### `scDesign3` Review

.pull-left[
<img src="figures/scdesignOverview2.png" width=400/>
]

.pull-right[
1. Next, we model the joint distribution of quantiles using a copula model.

1. This correlates genes even after conditioning on the same `$\mathbf{x}_{i}$`.
]

---

.section[
## Interface Design
]

---

### Nouns & Verbs

We can break the interface design question into two parts.

1. **Data Structures**: How can we represent the simulator in a way that is both
transparent to a human and precise enough for a computer?

1. **Operations**: How can we encourage users to study and tinker with the data
structures?

If the resulting grammar is expressive enough, then researchers will be able to
solve problems we may not have anticipated.

---

### Nouns: Gene-Level Models

For each gene, we can specify the regression formula and distributional family.

``` r
margins <- setup_margins(sce, ~ ns(pseudotime, 3), ~ ZINBLSS())
margins
```

<pre class="r-output"><code>## Plan:
## # A tibble: 6 × 3
## feature family link
## <gene_id> <distn> <link>
## 1 Pyy ZINBI [mu,sigma,nu] ~ns(pseudotime, 3)
## 2 Iapp ZINBI [mu,sigma,nu] ~ns(pseudotime, 3)
## 3 Chgb ZINBI [mu,sigma,nu] ~ns(pseudotime, 3)
## 4 Rbp4 ZINBI [mu,sigma,nu] ~ns(pseudotime, 3)
## 5 Spp1 ZINBI [mu,sigma,nu] ~ns(pseudotime, 3)
## 6 Chga ZINBI [mu,sigma,nu] ~ns(pseudotime, 3)
## Pyy, Iapp, Chgb, and 7 other features need fitting.
## Estimates:
## # A tibble: 0 × 0
</code></pre>

---

### Nouns: Copula Models

We can tie together a collection of marginals using a copula model.

.center[
 <img src="figures/gene-gene_dependence.png" width=900/>
]

---

### Extensibility

For the marginals, we can borrow from existing LSS packages (Rigby et al., 2005; Hofner et al., 2014) to implement gene-wise regressions,

* Gaussian, Gamma, Poisson, Binomial, Negative-Binomial, Zero-Inflated Poisson, Zero Inflated Negative Binomial,…

We can also build a unified interface to various copula estimation routines,

* Vine copulas for capturing higher-order moments.
* Gaussian and Student-t copulas with standard sample covariance, shrunken covariance, or graphical lasso precision estimates.

---

### Verbs: Plot

---

### Verbs: Mutate

`mutate` lets you modify a few elements from a larger simulator.

``` r
altered <- mutate(margins, Chga:Ins1, link = ~ pseudotime) |>
 estimate(sce)
```
<img src="20240531_files/figure-html/plot_alteration-1.png" width="800" style="display: block; margin: auto;" />

---

### Verbs: Mutate

1. Here is a more realistic example from a longitudinal microbiome study.
2. We can use `mutate` to define a synthetic null with no disease effect for a known subset of genes.

.pull-three-quarters-left[
<img src="figures/nulls_unaltered.png"/>
]
.pull-three-quarters-right[
<img src="figures/pairwise_cors.png"/>
]

---

### Verbs: Mutate

1. Here is a more realistic example from a longitudinal microbiome study.
2. We can use `mutate` to define a synthetic null with no disease effect for a known subset of genes.

.pull-three-quarters-left[
<img src="figures/altered_ns.png"/>
]
.pull-three-quarters-right[
<img src="figures/pairwise_cors_altered.png"/>
]

---

### Verbs: Join

We should make it possible to combine simulators like Lego blocks.

``` r
experiments <- list(methylation = SCGEMMETH_sce, rna = SCGEMRNA_sce)
families <- list(~ BI(), ~ GaussianLSS())
sims <- experiments |>
 map2(families, \(x, y) setup_simulator(x, ~ cell_type, y))
```

---

### Verbs: Join (Copula)

One approach is to merge the list of marginal distributions and re-estimate the joint distribution.

``` r
sim_joined <- map(sims, estimate, nu = 0.1) |>
 join_copula(copula_glasso())
```

This assumes that we have samples where all features are measured.

.center[
<img src="figures/copula_join.png" width = 680/>
]

---

### Verbs: Join (Conditioning)

Alternatively, we can combine two simulators by conditioning them on shared latent structure.

``` r
sim_joined <- join_pamona(sims)
```

<pre class="r-output"><code>## Pamona start!
## use random seed: 1
## Pamona Done! takes 1.464344 seconds
</code></pre>

.center[
<img src="figures/simulator_join.png" width = 800/>
]

---

### Verbs: Join (Conditioning)

This used partial manifold alignment (Cao et al., 2020) to
learn shared latent variables across assays and works even in the diagonal
integration setting.

``` r
sim_joined
```

<pre class="r-output"><code>## $methylation
## [Marginals]
## Plan:
## # A tibble: 6 × 3
## feature family link
## <gene_id> <distn> <link>
## 1 AK123759 BI [mu] ~cell_type + UMAP1 + UMAP2
## 2 ADAM33 BI [mu] ~cell_type + UMAP1 + UMAP2
## 3 NFIX BI [mu] ~cell_type + UMAP1 + UMAP2
## 4 FOXD2 BI [mu] ~cell_type + UMAP1 + UMAP2
## 5 HLX.AS1 BI [mu] ~cell_type + UMAP1 + UMAP2
## 6 LY86.AS1 BI [mu] ~cell_type + UMAP1 + UMAP2
## AK123759, ADAM33, NFIX, and 24 other features need fitting.
## Estimates:
## # A tibble: 0 × 0
## 
## [Dependence]
## 0 NULLs with features
## 
## [Template Data]
## class: SingleCellExperiment 
## dim: 27 142 
## metadata(0):
## assays(1): counts
## rownames(27): AK123759 ADAM33 ... KCNQ2 CDH22
## rowData names(0):
## colnames(142): CellMeth1 CellMeth2 ... CellMeth141 CellMeth142
## colData names(10): X1 X2 ... UMAP1 UMAP2
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## 
## $rna
## [Marginals]
## Plan:
## # A tibble: 6 × 3
## feature family link
## <gene_id> <distn> <link>
## 1 ZIC3 Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2
## 2 KCNQ2 Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2
## 3 ZFP42 Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2
## 4 OTX2 Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2
## 5 SALL4 Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2
## 6 NANOG Gaussian [mu,sigma] ~cell_type + UMAP1 + UMAP2
## ZIC3, KCNQ2, ZFP42, and 29 other features need fitting.
## Estimates:
## # A tibble: 0 × 0
## 
## [Dependence]
## 0 NULLs with features
## 
## [Template Data]
## class: SingleCellExperiment 
## dim: 32 177 
## metadata(0):
## assays(1): logcounts
## rownames(32): ZIC3 KCNQ2 ... NESTIN MYC
## rowData names(0):
## colnames(177): CellRNA1 CellRNA2 ... CellRNA176 CellRNA177
## colData names(10): X1 X2 ... UMAP1 UMAP2
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
</code></pre>

---

.section[
## Interactive Demo
]

.center[
 
 [go.wisc.edu/992c78](https://go.wisc.edu/992c78)

<img src="figures/demo_link.png" width=380/>
]

---

### Example 2: Power Analysis

1. With a given sample size and experimental design, what types of signals will be recoverable?

1. Even when we don't have access to analytical power formulas, we can draw
conclusions using computational experiments.

.center[
<img src="figures/concrete_power.png" width=700/>
]

---

### Problem Context

(Gavin et al., 2018) studied the metaproteomics of Type I Diabetes (T1D).
.pull-left[
* They sampled participants at several disease stages (33 with T1D, 17 at high risk, 29 at low risk, 22 healthy)
* Their analysis used sPLS-DA (Lê Cao et al., 2011), a variant of linear discriminant analysis.
]

.pull-right[
<img src="figures/gavin_t1d_summary.png" width=500/>
]

They found that intestinal and pancreas function are altered even before the onset of full T1D.

---

### Multivariate Power Analysis

What would be a good power analysis for a follow-up study?

1. Estimation: Learn a simulator from the 101 available samples.
2. Perturbation: Generate data with varying sample sizes and signals.
3. Evaluation: Compute the average sPLS-DA performance across replicates.

.center[
<img src="figures/power_overview.png" width=700/>
]

---

### Estimation

1. We fit Gaussian LSS models to log-ratio transformed measurements of the 427 proteins found in `$\geq$` 70% of subjects.
2. Means and variances are allowed to vary across low- and high-risk groups.
3. Main difficulty: The real data have long left tails.

.center[
<img src="figures/t1D-multivariate-0.png", width=800/>
]

---

### Perturbation

.pull-left[
* We `mutate` the simulators so that the true number of T1D-associated proteins is `$10, 30, \dots, 90$`
* We choose the true non-nulls from among proteins with observed strong effects
  - 32 are found significant at `$\alpha = 0.05$` in the original data
]

.pull-right[
<img src="figures/p_values.png"/>
]

---

### Evaluation

* The AUC for tuned sPLS-DA models varies from 0.5 (random guess) to over 0.9 in the high-signal regime.

* Potential Extension: Model patient heterogeneity (age, finer risk levels).

.center[
<img src="figures/t1D-multivariate-2.png" width=550/>
]

---

### Interactive Plot

We have an experimental interactive visualization that can be used for model criticism.

.center[
<a href="http://localhost:5000">
<img src="figures/scdesigner-interactive.png" width=700/>
</a>
]

---

### History

(Scott et al., 1954) is the earliest example I've found that uses
computer simulated data to guide data analysis.

.pull-left[
 <img src="figures/galaxies-real.png" width=400/>
]

.pull-right[
 <img src="figures/galaxies-synthetic.png" width=400/>
]

---

### Upcoming Short Course

**Registration**: [go.wisc.edu/vsn1ix](https://go.wisc.edu/vsn1ix) (Free, but limited seats)

When: June 11, 18, 25 from 10am - 12pm

This hands-on short course will:

1. Review the computing and statistical concepts helpful for effective
simulation in biological applications.
1. Provide in-depth code examples of how to apply simulation to solve power
analysis and benchmarking problems.
1. An opportunity to discuss simulation problems with peers and
experts.

---

### Upcoming Short Course

**Registration**: [go.wisc.edu/vsn1ix](https://go.wisc.edu/vsn1ix) (Free, but limited seats)

June 11 - Data Structures; Differential Testing and Simulation

June 18 - Multivariate Analysis and Simulation

June 25 - Visual Evaluation; Multi-omics Integration and Simulation
 
 
 
.center[
 <img src="figures/distributions.png" width=700/>
]

---

### Summary

1. Simulations support the design and comparison of complex workflows. This is
especially important for honest power analysis and benchmarking.

1. They encourage critical, quantitative thinking through computational
experiments.  They can serve as a gateway to advanced statistics.

1. Our interface makes these simulators transparent and adaptable.

---

### Thank you!

* Lab Members: Margaret Thairu, Hanying Jiang, Shuchen Yan, Mason Garza, Yuliang Peng, Kaiyan Ma, Kai Cui, and Kobe Uko
* Discussions with Saritha Kodikara (UniMelb), Kim-Anh Le Cao (UniMelb), Jingyi Jessica Li (UCLA), Andres Colubri (UMass)
* Funding: NIGMS R01GM152744.

---

### References

[1] K. Cao et al. "Manifold alignment for heterogeneous single-cell
multi-omics data integration using Pamona". In: _Bioinformatics_ 38
(2020), pp. 211 - 219.
<https://api.semanticscholar.org/CorpusID:226959190>.

[2] P. G. Gavin et al. "Intestinal Metaproteomics Reveals
Host-Microbiota Interactions in Subjects at Risk for Type 1 Diabetes".
In: _Diabetes Care_ 41.10 (Aug. 2018), p. 2178–2186. ISSN: 1935-5548.
DOI: 10.2337/dc18-0777. <http://dx.doi.org/10.2337/dc18-0777>.

[3] J. Heer. "Agency plus automation: Designing artificial intelligence
into interactive systems". In: _Proceedings of the National Academy of
Sciences_ 116.6 (Feb. 2019), p. 1844–1850. ISSN: 1091-6490. DOI:
10.1073/pnas.1807184115. <http://dx.doi.org/10.1073/pnas.1807184115>.

---

### References

[1] B. Hofner et al. "gamboostLSS: An R Package for Model Building and
Variable Selection in the GAMLSS Framework". In: _arXiv: Computation_
(2014). <https://api.semanticscholar.org/CorpusID:62302678>.

[2] D. Lahnemann et al. "Eleven grand challenges in single-cell data
science". In: _Genome Biology_ 21 (2020).
<https://api.semanticscholar.org/CorpusID:211054289>.

[3] K. Lê Cao et al. "Sparse PLS discriminant analysis: biologically
relevant feature selection and graphical displays for multiclass
problems". In: _BMC Bioinformatics_ 12.1 (Jun. 2011). ISSN: 1471-2105.
DOI: 10.1186/1471-2105-12-253.
<http://dx.doi.org/10.1186/1471-2105-12-253>.

---

### References

[1] R. A. Rigby et al. "Generalized additive models for location, scale
and shape". In: _Journal of the Royal Statistical Society: Series C
(Applied Statistics)_ 54 (2005).
<https://api.semanticscholar.org/CorpusID:56076930>.

[2] M. S. M. Sajjadi et al. "Assessing Generative Models via Precision
and Recall". In: _ArXiv_ abs/1806.00035 (2018).
<https://api.semanticscholar.org/CorpusID:44104089>.

[3] E. L. Scott et al. "Comparison of the Synthetic and Actual
Distribution of Galaxies on a Photographic Plate." In: _The
Astrophysical Journal_ 119 (Jan. 1954), p. 91. ISSN: 1538-4357. DOI:
10.1086/145799. <http://dx.doi.org/10.1086/145799>.

---

### References

[1] D. Song et al. "scDesign3 generates realistic in silico data for
multimodal single-cell and spatial omics". In: _Nature Biotechnology_
(2023), pp. 1-6. <https://api.semanticscholar.org/CorpusID:258638565>.

[2] L. Zappia et al. "Splatter: simulation of single-cell RNA
sequencing data". In: _Genome Biology_ 18 (2017).
<https://api.semanticscholar.org/CorpusID:9332625>.

---

### Verbs: Plot

``` r
sample(simulator) |>
  plot_correlations(points = point_opts)
```

---

### Verbs: Mutate

There are many reasons we might want to alter an initial simulator,
* **Synthetic Nulls**: We can define negative controls to safeguard against false discoveries.
* **Iteration**: We may want to improve the simulator’s goodness of fit by modifying the regression.
* **Power Analysis**: We may want to construct data from alternative experimental designs or biological signal strengths.

---

### Modularity - Augment

Another way to alter a simulator is to make implicit data explicitly available,
in the spirit of the broom package's `augment`.

For example, these data appear to have mouse-level effects.

.center[
<img src="figures/subject_effect_real.png" width=750/>
]

---

### Modularity - Augment

1. Unfortunately, none of the packages we use at the marginal estimation step support random effects.

1. We defined an `augment` function that defines a simple proxy,

* Compute sample-wise low-dimensional multidimensional scaling coordinates `$z_{i}$`.
    * For mouse `$s$`, condition on its average coordinates: `$F_{g}\left(y_{ig} \vert x_{i}, \bar{z}_{s\left(i\right)}\right)$`
    * Use with caution: The `$z_{i}$` will be correlated with the diet effect.

.large-code[

``` r
exper <- augment_mds(exper, K = 4) |>
 groupwise_average(starts_with("MDS"), MouseID)
```
]

---

### Modularity - Augment

* Compute sample-wise low-dimensional multidimensional scaling coordinates `$z_{i}$`.
* For mouse `$s$`, condition on its average coordinates: `$F_{g}\left(y_{ig} \vert x_{i}, \bar{z}_{s\left(i\right)}\right)$`
* Use with caution: The `$z_{i}$` will be correlated with the diet effect.

.center[
<img src="figures/mouse_effect_comparison.png" width=1000/>
]

---

### Example: Multi-omics Power Analysis

1. In practice, most power analyses are univariate, e.g., to guide differential expression.
2. How should we do multivariate power analysis, especially with several tables?

---

### Example

Let’s consider the microbiome + metabolome mouse study from (Callahan et al., 2016). This had 12 samples.
1. How does estimation quality improve with sample size?
2. Is it worth gathering unpaired samples? (mosaic designs)

.center[
<img src="figures/cca_f1000.gif" width=650/>
]

---

### Example: Estimate Simulators

1. Fit ZINB models to the microbiome.
2. Fit Gaussian models to the metabolome.
3. Join them using highly regularized covariance estimates.

.center[
 <img src="figures/cca_microbiome_hist.png"/>
]

---

### Part 1: Estimate Simulators

1. Fit ZINB models to the microbiome.
2. Fit Gaussian models to the metabolome.
3. Join them using highly regularized covariance estimates.

.center[
 <img src="figures/cca_metab_hist.png"/>
]

---

### Part 2: Hypothetical Experiments
.pull-left[
1. Across sample sizes `$n$`, we simulate new data, fit sparse CCA, and identify a
basis `$\hat{V}_{n}$` spanning the `$K = 5$` left and right sparse CCA factors
1. We use `$\hat{V}_{5000}$` as a reference and compute canonical angles (Golub et al., 1994) with all smaller sample sizes.
]

.pull-right[
<img src="figures/canonical_angles_paper.png"/>
Figure from (Arikawa, 2018).
]

---

### Part 2: Hypothetical Experiments

With more samples, the canonical angles shrink predictably.

.center[
<img src="figures/canonical_angles.png" width=750/>
]

---

### Mosaic Case

Here, we have allowed the sample sizes to differ and use KNN imputation to run
CCA on an imputed table.

.center[
<img src="figures/mosaic_example.png" width=750/>
]

---

### Aside: Negative Control
The default sPLS-DA visualization output can separate classes even when no signals exist. This is a consequence of overfitting.

.center[
<img src="figures/t1D-null-overfit.png" width=400/>
]

---

### Future Directions: Improving Generators

.pull-left[
GAMLSS models are simple and interpretable. However,

* They can oversmooth sharp transitions.
* Fitting each feature separately is both statistically and computationally inefficient.
]

.pull-right[
<img src="figures/spatial_example_smoothness.png"/>
]

---

### Future Directions: Quantitative Evaluation

.pull-left[
We need formal ways for gauging the quality of a simulator.

* Existing evaluation criteria for simulated omics data rely on ad-hoc summaries.
* GAN precision-recall measures can offer a more systematic approach (Sajjadi et al., 2018).
]

.pull-right[
 <img src="figures/gan-eval2.png"/>
]

---

### Future Directions: Quantitative Evaluation

.pull-left[
We need formal ways for gauging the quality of a simulator.

* Existing evaluation criteria for simulated omics data rely on ad-hoc summaries.
* GAN precision-recall measures can offer a more systematic approach (Sajjadi et al., 2018).
]

.pull-right[
 <img src="figures/gan-eval1.png"/>
]

---
### Future Directions: Autocompletion

* Many parameters can influence simulation quality: Preprocessing, distributional assumptions, covariates to include, spline basis
* Can we "recommend" the next step of simulator construction, based on initial user specification (Heer, 2019)?

.center[
<img src="figures/wrangler.png" width=500/>
]