1 Introduction
This website accompanies the review Semisynthetic Simulation for Microbiome Analysis. All examples discussed in the case studies there can be reproduced by running the code below, and we have also included additional introductory material about some packages that simplify our analysis.
1.1 Setup
The MIGsim
package below provides wrapper functions that we will use
throughout these book. You can install it by running:
devtools::install_github("krisrs1128/intro-to-simulation/MIGsim")
This block loads all packages that will be used in this book.
suppressPackageStartupMessages({
library(CovTools)
library(SpiecEasi)
library(SummarizedExperiment)
library(dplyr)
library(forcats)
library(gamboostLSS)
library(ggdist)
library(ggplot2)
library(glue)
library(mixOmics)
library(mutoss)
library(patchwork)
library(pwr)
library(scico)
library(splines)
library(tibble)
library(tidyr)
library(MIGsim)
library(purrr)
library(scDesigner)
library(gamlss)
library(mia)
library(wesanderson)
library(SimBench)
library(tidySummarizedExperiment)
library(ks)
library(stringr)
library(ANCOMBC)
library(edgeR)
library(ggrepel)
library(ggpubr)
})
theme_set(theme_classic() + theme(rect = element_rect(fill = "transparent")))
1.2 Using SummarizedExperiment
SummarizedExperiment
data structures simplify manipulation of sequencing
experiments, and we’ll be using them throughout these tutorials. For example,
they distinguish between molecule counts, which are stored in the assay
slot,
and sample descriptors, which are stored in colData
. At the same time, these
separate components are nicely synchronized. For example, subsetting samples
from one of these tables automatically subsets the other.
The line below loads a small subset of genera from the Atlas experiment, which profiled the gut microbiomes from 1006 healthy adults in Western Europe.
##
## Actinobacteria Bacteroidetes Firmicutes
## 1 2 21
mean(atlas$age)
## [1] 45.15629
## Allistipes et rel. Anaerostipes caccae et rel.
## 289.85263 123.84211
## Bacteroides vulgatus et rel. Bifidobacterium
## 1235.44211 120.72982
## Bryantella formatexigens et rel. Butyrivibrio crossotus et rel.
## 137.74737 188.54035
## Clostridium cellulosi et rel. Clostridium leptum et rel.
## 437.13684 129.88070
## Clostridium nexile et rel. Clostridium orbiscindens et rel.
## 73.69474 231.82807
## Clostridium sphenoides et rel. Clostridium symbiosum et rel.
## 139.35439 331.90175
## Coprococcus eutactus et rel. Dorea formicigenerans et rel.
## 222.15088 176.07368
## Lachnospira pectinoschiza et rel. Oscillospira guillermondii et rel.
## 127.75439 1560.27368
## Outgrouping clostridium cluster XIVa Ruminococcus bromii et rel.
## 91.93684 100.09123
## Ruminococcus callidus et rel. Ruminococcus obeum et rel.
## 103.61754 449.28421
## Sporobacter termitidis et rel. Subdoligranulum variable at rel.
## 423.71579 442.34737
## Uncultured Clostridiales I Uncultured Clostridiales II
## 191.28772 152.83509
Exercise: To practice working with SummarizedExperiment
objects, try answering:
- How many genera are available in this experiment object?
- What was the most common phylum in this dataset?
- What was the average participant age?
- What was the average abundance of
Allistipes et rel.
among people in theobese
BMI group?
Hint: The most important functions are assay()
, rowData()
, and colData()
.
Solution
nrow(atlas)
## [1] 24
##
## Actinobacteria Bacteroidetes Firmicutes
## 1 2 21
mean(atlas$age)
## [1] 45.15629
## Allistipes et rel. Anaerostipes caccae et rel.
## 289.85263 123.84211
## Bacteroides vulgatus et rel. Bifidobacterium
## 1235.44211 120.72982
## Bryantella formatexigens et rel. Butyrivibrio crossotus et rel.
## 137.74737 188.54035
## Clostridium cellulosi et rel. Clostridium leptum et rel.
## 437.13684 129.88070
## Clostridium nexile et rel. Clostridium orbiscindens et rel.
## 73.69474 231.82807
## Clostridium sphenoides et rel. Clostridium symbiosum et rel.
## 139.35439 331.90175
## Coprococcus eutactus et rel. Dorea formicigenerans et rel.
## 222.15088 176.07368
## Lachnospira pectinoschiza et rel. Oscillospira guillermondii et rel.
## 127.75439 1560.27368
## Outgrouping clostridium cluster XIVa Ruminococcus bromii et rel.
## 91.93684 100.09123
## Ruminococcus callidus et rel. Ruminococcus obeum et rel.
## 103.61754 449.28421
## Sporobacter termitidis et rel. Subdoligranulum variable at rel.
## 423.71579 442.34737
## Uncultured Clostridiales I Uncultured Clostridiales II
## 191.28772 152.83509
1.3 Warm-up: A Gaussian Example
Here’s a toy dataset to illustrate the main idea of GAMLSS. Each panel in the plot below represents a different feature (e.g., taxon, gene, metabolite…). The abundance varies smoothly over time, and in the first three panels, the trends differ by group assignment.
data(exper_ts)
exper_lineplot(exper_ts)
## Warning: Removed 12 rows containing missing values or values outside the scale range (`geom_point()`).
We can try to approximate these data with a new simulator. The setup_simulator
command takes the template SummarizedExperiment
object as its first argument.
The second gives an R formula syntax-style specification of GAMLSS parameters
(mean and SD, in this case) dependence on sample properties. The last argument
gives the type of model to fit, in this case, a Gaussian location-shape-scale
model.
sim <- setup_simulator(exper_ts, ~ ns(time, df = 7) * group, ~ GaussianLSS()) |>
estimate(nu = 0.01, mstop = 1000)
sample(sim) |>
exper_lineplot()
## Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).
Exercise: Right now, each panel allows for an interaction between the trend and group type. Can you define a simulator where the groups have no effect on the trends for the first two panels? This is the basis for defining synthetic negative controls.
sim <- sim |>
scDesigner::mutate(
1:2,
link = ~ ns(time, df = 7)
) |>
estimate(nu = 0.01, mstop = 1000)
sample(sim) |>
exper_lineplot()
Solution: We can modify the formula so that it no longer has an interaction
with group. We just need to remove the * group
from the original formula in our updated
link function. To ensure that this only applies to the first two panels, we use
1:2 in the first argument of mutate
. This first argument specifies which
features to apply the new formula to.
sim <- sim |>
scDesigner::mutate(1:2, link = ~ ns(time, df = 7)) |>
estimate(nu = 0.01, mstop = 1000)
sample(sim) |>
exper_lineplot()
## Warning: Removed 5 rows containing missing values or values outside the scale range (`geom_point()`).
## R version 4.4.1 Patched (2024-08-21 r87049)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] splines parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggpubr_0.6.0 ggrepel_0.9.6 edgeR_4.3.16
## [4] limma_3.61.10 ANCOMBC_2.7.0 stringr_1.5.1
## [7] ks_1.14.3 tidySummarizedExperiment_1.15.1 ttservice_0.4.1
## [10] SimBench_0.99.1 wesanderson_0.3.7 mia_1.13.36
## [13] MultiAssayExperiment_1.31.5 TreeSummarizedExperiment_2.13.0 Biostrings_2.73.1
## [16] XVector_0.45.0 SingleCellExperiment_1.27.2 gamlss_5.4-22
## [19] nlme_3.1-166 gamlss.dist_6.1-1 gamlss.data_6.0-6
## [22] scDesigner_0.0.0.9000 purrr_1.0.2 MIGsim_0.0.0.9000
## [25] tidyr_1.3.1 tibble_3.2.1 scico_1.5.0
## [28] pwr_1.3-0 patchwork_1.3.0 mutoss_0.1-13
## [31] mvtnorm_1.3-1 mixOmics_6.29.0 lattice_0.22-6
## [34] MASS_7.3-61 glue_1.8.0 ggplot2_3.5.1
## [37] ggdist_3.3.2 gamboostLSS_2.0-7 mboost_2.9-11
## [40] stabs_0.6-4 forcats_1.0.0 dplyr_1.1.4
## [43] SummarizedExperiment_1.35.1 Biobase_2.65.1 GenomicRanges_1.57.1
## [46] GenomeInfoDb_1.41.1 IRanges_2.39.2 S4Vectors_0.43.2
## [49] BiocGenerics_0.51.1 MatrixGenerics_1.17.0 matrixStats_1.4.1
## [52] SpiecEasi_1.1.3 CovTools_0.5.4
##
## loaded via a namespace (and not attached):
## [1] igraph_2.0.3 ica_1.0-3 plotly_4.10.4
## [4] Formula_1.2-5 scater_1.33.4 zlibbioc_1.51.1
## [7] tidyselect_1.2.1 bit_4.5.0 doParallel_1.0.17
## [10] pspline_1.0-20 blob_1.2.4 rngtools_1.5.2
## [13] S4Arrays_1.5.8 png_0.1-8 plotrix_3.8-4
## [16] cli_3.6.3 CVXR_1.0-14 pulsar_0.3.11
## [19] multtest_2.61.0 goftest_1.2-3 bluster_1.15.1
## [22] BiocNeighbors_1.99.1 uwot_0.2.2 mime_0.12
## [25] evaluate_1.0.0 tidytree_0.4.6 leiden_0.4.3.1
## [28] stringi_1.8.4 backports_1.5.0 lmerTest_3.1-3
## [31] Exact_3.3 gsl_2.1-8 httpuv_1.6.15
## [34] AnnotationDbi_1.67.0 magrittr_2.0.3 mclust_6.1.1
## [37] ADGofTest_0.3 doRNG_1.8.6 pcaPP_2.0-5
## [40] rootSolve_1.8.2.4 sctransform_0.4.1 lpSolve_5.6.21
## [43] ggbeeswarm_0.7.2 DBI_1.2.3 jquerylib_0.1.4
## [46] withr_3.0.1 corpcor_1.6.10 class_7.3-22
## [49] lmtest_0.9-40 htmlwidgets_1.6.4 fs_1.6.4
## [52] labeling_0.4.3 SparseArray_1.5.38 cellranger_1.1.0
## [55] DESeq2_1.45.3 extrafont_0.19 lmom_3.0
## [58] flare_1.7.0.1 reticulate_1.39.0 minpack.lm_1.2-4
## [61] geigen_2.3 zoo_1.8-12 knitr_1.48
## [64] nnls_1.5 UCSC.utils_1.1.0 SHT_0.1.8
## [67] decontam_1.25.0 foreach_1.5.2 fansi_1.0.6
## [70] grid_4.4.1 data.table_1.16.0 vegan_2.6-8
## [73] RSpectra_0.16-2 irlba_2.3.5.1 extrafontdb_1.0
## [76] DescTools_0.99.56 ellipsis_0.3.2 fastDummies_1.7.4
## [79] lazyeval_0.2.2 yaml_2.3.10 survival_3.7-0
## [82] scattermore_1.2 crayon_1.5.3 mediation_4.5.0
## [85] RcppAnnoy_0.0.22 RColorBrewer_1.1-3 progressr_0.14.0
## [88] later_1.3.2 ggridges_0.5.6 codetools_0.2-20
## [91] base64enc_0.1-3 Seurat_5.1.0 KEGGREST_1.45.1
## [94] Rtsne_0.17 shape_1.4.6.1 randtoolbox_2.0.4
## [97] foreign_0.8-87 pkgconfig_2.0.3 xml2_1.3.6
## [100] spatstat.univar_3.0-1 downlit_0.4.4 scatterplot3d_0.3-44
## [103] spatstat.sparse_3.1-0 ape_5.8 viridisLite_0.4.2
## [106] xtable_1.8-4 highr_0.11 car_3.1-2
## [109] fastcluster_1.2.6 plyr_1.8.9 httr_1.4.7
## [112] rbibutils_2.2.16 tools_4.4.1 globals_0.16.3
## [115] SeuratObject_5.0.2 kde1d_1.0.7 beeswarm_0.4.0
## [118] htmlTable_2.4.3 broom_1.0.6 checkmate_2.3.2
## [121] rgl_1.3.1 assertthat_0.2.1 lme4_1.1-35.5
## [124] rngWELL_0.10-9 digest_0.6.37 permute_0.9-7
## [127] numDeriv_2016.8-1.1 bookdown_0.40 Matrix_1.7-0
## [130] farver_2.1.2 reshape2_1.4.4 yulab.utils_0.1.7
## [133] viridis_0.6.5 DirichletMultinomial_1.47.0 rpart_4.1.23
## [136] cachem_1.1.0 polyclip_1.10-7 WGCNA_1.73
## [139] Hmisc_5.1-3 generics_0.1.3 dynamicTreeCut_1.63-1
## [142] parallelly_1.38.0 statmod_1.5.0 impute_1.79.0
## [145] huge_1.3.5 RcppHNSW_0.6.0 ScaledMatrix_1.13.0
## [148] carData_3.0-5 minqa_1.2.8 pbapply_1.7-2
## [151] glmnet_4.1-8 spam_2.10-0 utf8_1.2.4
## [154] gtools_3.9.5 shapes_1.2.7 readxl_1.4.3
## [157] preprocessCore_1.67.0 inum_1.0-5 ggsignif_0.6.4
## [160] energy_1.7-12 gridExtra_2.3 shiny_1.9.1
## [163] GenomeInfoDbData_1.2.12 memoise_2.0.1 rmarkdown_2.28
## [166] scales_1.3.0 gld_2.6.6 stabledist_0.7-2
## [169] partykit_1.2-22 future_1.34.0 RANN_2.6.2
## [172] spatstat.data_3.1-2 rstudioapi_0.16.0 cluster_2.1.6
## [175] spatstat.utils_3.1-0 hms_1.1.3 fitdistrplus_1.2-1
## [178] munsell_0.5.1 cowplot_1.1.3 colorspace_2.1-1
## [181] ellipse_0.5.0 rlang_1.1.4 quadprog_1.5-8
## [184] copula_1.1-4 DelayedMatrixStats_1.27.3 sparseMatrixStats_1.17.2
## [187] dotCall64_1.1-1 scuttle_1.15.4 mgcv_1.9-1
## [190] xfun_0.47 e1071_1.7-16 TH.data_1.1-2
## [193] iterators_1.0.14 rARPACK_0.11-0 abind_1.4-8
## [196] libcoin_1.0-10 treeio_1.29.1 gmp_0.7-5
## [199] DECIPHER_3.1.4 Rdpack_2.6.1 promises_1.3.0
## [202] RSQLite_2.3.7 sandwich_3.1-1 proxy_0.4-27
## [205] DelayedArray_0.31.11 Rmpfr_0.9-5 GO.db_3.20.0
## [208] compiler_4.4.1 prettyunits_1.2.0 boot_1.3-31
## [211] distributional_0.5.0 beachmat_2.21.6 rbiom_1.0.3
## [214] listenv_0.9.1 Rcpp_1.0.13 Rttf2pt1_1.3.12
## [217] BiocSingular_1.21.4 tensor_1.5 progress_1.2.3
## [220] BiocParallel_1.39.0 insight_0.20.5 spatstat.random_3.3-2
## [223] R6_2.5.1 fastmap_1.2.0 multcomp_1.4-26
## [226] rstatix_0.7.2 vipor_0.4.7 ROCR_1.0-11
## [229] rsvd_1.0.5 nnet_7.3-19 gtable_0.3.5
## [232] KernSmooth_2.23-24 miniUI_0.1.1.1 deldir_2.0-4
## [235] htmltools_0.5.8.1 ggthemes_5.1.0 RcppParallel_5.1.9
## [238] bit64_4.5.2 spatstat.explore_3.3-2 lifecycle_1.0.4
## [241] nloptr_2.1.1 sass_0.4.9 vctrs_0.6.5
## [244] VGAM_1.1-12 slam_0.1-53 spatstat.geom_3.3-3
## [247] sp_2.1-4 future.apply_1.11.2 pracma_2.4.4
## [250] rvinecopulib_0.6.3.1.1 bslib_0.8.0 pillar_1.9.0
## [253] locfit_1.5-9.10 jsonlite_1.8.9 expm_1.0-0