1 Introduction

This website accompanies the review Semisynthetic Simulation for Microbiome Analysis. All examples discussed in the case studies there can be reproduced by running the code below, and we have also included additional introductory material about some packages that simplify our analysis.

1.1 Setup

The MIGsim package below provides wrapper functions that we will use throughout these book. You can install it by running:

devtools::install_github("krisrs1128/intro-to-simulation/MIGsim")

This block loads all packages that will be used in this book.

1.2 Using SummarizedExperiment

SummarizedExperiment data structures simplify manipulation of sequencing experiments, and we’ll be using them throughout these tutorials. For example, they distinguish between molecule counts, which are stored in the assay slot, and sample descriptors, which are stored in colData. At the same time, these separate components are nicely synchronized. For example, subsetting samples from one of these tables automatically subsets the other.

The line below loads a small subset of genera from the Atlas experiment, which profiled the gut microbiomes from 1006 healthy adults in Western Europe.

data(atlas)
table(rowData(atlas)$Phylum)
## 
## Actinobacteria  Bacteroidetes     Firmicutes 
##              1              2             21
mean(atlas$age)
## [1] 45.15629
ix <- colData(atlas)$bmi_group == "obese"
abundances <- assay(atlas)
rowMeans(abundances[, ix])
##                   Allistipes et rel.          Anaerostipes caccae et rel. 
##                            289.85263                            123.84211 
##         Bacteroides vulgatus et rel.                      Bifidobacterium 
##                           1235.44211                            120.72982 
##     Bryantella formatexigens et rel.       Butyrivibrio crossotus et rel. 
##                            137.74737                            188.54035 
##        Clostridium cellulosi et rel.           Clostridium leptum et rel. 
##                            437.13684                            129.88070 
##           Clostridium nexile et rel.     Clostridium orbiscindens et rel. 
##                             73.69474                            231.82807 
##       Clostridium sphenoides et rel.        Clostridium symbiosum et rel. 
##                            139.35439                            331.90175 
##         Coprococcus eutactus et rel.        Dorea formicigenerans et rel. 
##                            222.15088                            176.07368 
##    Lachnospira pectinoschiza et rel.   Oscillospira guillermondii et rel. 
##                            127.75439                           1560.27368 
## Outgrouping clostridium cluster XIVa          Ruminococcus bromii et rel. 
##                             91.93684                            100.09123 
##        Ruminococcus callidus et rel.           Ruminococcus obeum et rel. 
##                            103.61754                            449.28421 
##       Sporobacter termitidis et rel.     Subdoligranulum variable at rel. 
##                            423.71579                            442.34737 
##           Uncultured Clostridiales I          Uncultured Clostridiales II 
##                            191.28772                            152.83509

Exercise: To practice working with SummarizedExperiment objects, try answering:

  • How many genera are available in this experiment object?
  • What was the most common phylum in this dataset?
  • What was the average participant age?
  • What was the average abundance of Allistipes et rel. among people in the obese BMI group?

Hint: The most important functions are assay(), rowData(), and colData().

Solution

nrow(atlas)
## [1] 24
table(rowData(atlas)$Phylum)
## 
## Actinobacteria  Bacteroidetes     Firmicutes 
##              1              2             21
mean(atlas$age)
## [1] 45.15629
atlas[, atlas$bmi_group == "obese"] |>
  assay() |>
  rowMeans()
##                   Allistipes et rel.          Anaerostipes caccae et rel. 
##                            289.85263                            123.84211 
##         Bacteroides vulgatus et rel.                      Bifidobacterium 
##                           1235.44211                            120.72982 
##     Bryantella formatexigens et rel.       Butyrivibrio crossotus et rel. 
##                            137.74737                            188.54035 
##        Clostridium cellulosi et rel.           Clostridium leptum et rel. 
##                            437.13684                            129.88070 
##           Clostridium nexile et rel.     Clostridium orbiscindens et rel. 
##                             73.69474                            231.82807 
##       Clostridium sphenoides et rel.        Clostridium symbiosum et rel. 
##                            139.35439                            331.90175 
##         Coprococcus eutactus et rel.        Dorea formicigenerans et rel. 
##                            222.15088                            176.07368 
##    Lachnospira pectinoschiza et rel.   Oscillospira guillermondii et rel. 
##                            127.75439                           1560.27368 
## Outgrouping clostridium cluster XIVa          Ruminococcus bromii et rel. 
##                             91.93684                            100.09123 
##        Ruminococcus callidus et rel.           Ruminococcus obeum et rel. 
##                            103.61754                            449.28421 
##       Sporobacter termitidis et rel.     Subdoligranulum variable at rel. 
##                            423.71579                            442.34737 
##           Uncultured Clostridiales I          Uncultured Clostridiales II 
##                            191.28772                            152.83509

1.3 Warm-up: A Gaussian Example

Here’s a toy dataset to illustrate the main idea of GAMLSS. Each panel in the plot below represents a different feature (e.g., taxon, gene, metabolite…). The abundance varies smoothly over time, and in the first three panels, the trends differ by group assignment.

data(exper_ts)
exper_lineplot(exper_ts)
## Warning: Removed 12 rows containing missing values or values outside the scale range (`geom_point()`).

We can try to approximate these data with a new simulator. The setup_simulator command takes the template SummarizedExperiment object as its first argument. The second gives an R formula syntax-style specification of GAMLSS parameters (mean and SD, in this case) dependence on sample properties. The last argument gives the type of model to fit, in this case, a Gaussian location-shape-scale model.

sim <- setup_simulator(exper_ts, ~ ns(time, df = 7) * group, ~ GaussianLSS()) |>
  estimate(nu = 0.01, mstop = 1000)

sample(sim) |>
  exper_lineplot()
## Warning: Removed 3 rows containing missing values or values outside the scale range (`geom_point()`).

Exercise: Right now, each panel allows for an interaction between the trend and group type. Can you define a simulator where the groups have no effect on the trends for the first two panels? This is the basis for defining synthetic negative controls.

sim <- sim |>
  scDesigner::mutate(
    1:2,
    link = ~ ns(time, df = 7)
  ) |>
  estimate(nu = 0.01, mstop = 1000)

sample(sim) |>
  exper_lineplot()

Solution: We can modify the formula so that it no longer has an interaction with group. We just need to remove the * group from the original formula in our updated link function. To ensure that this only applies to the first two panels, we use 1:2 in the first argument of mutate. This first argument specifies which features to apply the new formula to.

sim <- sim |>
  scDesigner::mutate(1:2, link = ~ ns(time, df = 7)) |>
  estimate(nu = 0.01, mstop = 1000)

sample(sim) |>
  exper_lineplot()
## Warning: Removed 5 rows containing missing values or values outside the scale range (`geom_point()`).
## R version 4.4.1 Patched (2024-08-21 r87049)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggpubr_0.6.0                    ggrepel_0.9.6                   edgeR_4.3.16                   
##  [4] limma_3.61.10                   ANCOMBC_2.7.0                   stringr_1.5.1                  
##  [7] ks_1.14.3                       tidySummarizedExperiment_1.15.1 ttservice_0.4.1                
## [10] SimBench_0.99.1                 wesanderson_0.3.7               mia_1.13.36                    
## [13] MultiAssayExperiment_1.31.5     TreeSummarizedExperiment_2.13.0 Biostrings_2.73.1              
## [16] XVector_0.45.0                  SingleCellExperiment_1.27.2     gamlss_5.4-22                  
## [19] nlme_3.1-166                    gamlss.dist_6.1-1               gamlss.data_6.0-6              
## [22] scDesigner_0.0.0.9000           purrr_1.0.2                     MIGsim_0.0.0.9000              
## [25] tidyr_1.3.1                     tibble_3.2.1                    scico_1.5.0                    
## [28] pwr_1.3-0                       patchwork_1.3.0                 mutoss_0.1-13                  
## [31] mvtnorm_1.3-1                   mixOmics_6.29.0                 lattice_0.22-6                 
## [34] MASS_7.3-61                     glue_1.8.0                      ggplot2_3.5.1                  
## [37] ggdist_3.3.2                    gamboostLSS_2.0-7               mboost_2.9-11                  
## [40] stabs_0.6-4                     forcats_1.0.0                   dplyr_1.1.4                    
## [43] SummarizedExperiment_1.35.1     Biobase_2.65.1                  GenomicRanges_1.57.1           
## [46] GenomeInfoDb_1.41.1             IRanges_2.39.2                  S4Vectors_0.43.2               
## [49] BiocGenerics_0.51.1             MatrixGenerics_1.17.0           matrixStats_1.4.1              
## [52] SpiecEasi_1.1.3                 CovTools_0.5.4                 
## 
## loaded via a namespace (and not attached):
##   [1] igraph_2.0.3                ica_1.0-3                   plotly_4.10.4              
##   [4] Formula_1.2-5               scater_1.33.4               zlibbioc_1.51.1            
##   [7] tidyselect_1.2.1            bit_4.5.0                   doParallel_1.0.17          
##  [10] pspline_1.0-20              blob_1.2.4                  rngtools_1.5.2             
##  [13] S4Arrays_1.5.8              png_0.1-8                   plotrix_3.8-4              
##  [16] cli_3.6.3                   CVXR_1.0-14                 pulsar_0.3.11              
##  [19] multtest_2.61.0             goftest_1.2-3               bluster_1.15.1             
##  [22] BiocNeighbors_1.99.1        uwot_0.2.2                  mime_0.12                  
##  [25] evaluate_1.0.0              tidytree_0.4.6              leiden_0.4.3.1             
##  [28] stringi_1.8.4               backports_1.5.0             lmerTest_3.1-3             
##  [31] Exact_3.3                   gsl_2.1-8                   httpuv_1.6.15              
##  [34] AnnotationDbi_1.67.0        magrittr_2.0.3              mclust_6.1.1               
##  [37] ADGofTest_0.3               doRNG_1.8.6                 pcaPP_2.0-5                
##  [40] rootSolve_1.8.2.4           sctransform_0.4.1           lpSolve_5.6.21             
##  [43] ggbeeswarm_0.7.2            DBI_1.2.3                   jquerylib_0.1.4            
##  [46] withr_3.0.1                 corpcor_1.6.10              class_7.3-22               
##  [49] lmtest_0.9-40               htmlwidgets_1.6.4           fs_1.6.4                   
##  [52] labeling_0.4.3              SparseArray_1.5.38          cellranger_1.1.0           
##  [55] DESeq2_1.45.3               extrafont_0.19              lmom_3.0                   
##  [58] flare_1.7.0.1               reticulate_1.39.0           minpack.lm_1.2-4           
##  [61] geigen_2.3                  zoo_1.8-12                  knitr_1.48                 
##  [64] nnls_1.5                    UCSC.utils_1.1.0            SHT_0.1.8                  
##  [67] decontam_1.25.0             foreach_1.5.2               fansi_1.0.6                
##  [70] grid_4.4.1                  data.table_1.16.0           vegan_2.6-8                
##  [73] RSpectra_0.16-2             irlba_2.3.5.1               extrafontdb_1.0            
##  [76] DescTools_0.99.56           ellipsis_0.3.2              fastDummies_1.7.4          
##  [79] lazyeval_0.2.2              yaml_2.3.10                 survival_3.7-0             
##  [82] scattermore_1.2             crayon_1.5.3                mediation_4.5.0            
##  [85] RcppAnnoy_0.0.22            RColorBrewer_1.1-3          progressr_0.14.0           
##  [88] later_1.3.2                 ggridges_0.5.6              codetools_0.2-20           
##  [91] base64enc_0.1-3             Seurat_5.1.0                KEGGREST_1.45.1            
##  [94] Rtsne_0.17                  shape_1.4.6.1               randtoolbox_2.0.4          
##  [97] foreign_0.8-87              pkgconfig_2.0.3             xml2_1.3.6                 
## [100] spatstat.univar_3.0-1       downlit_0.4.4               scatterplot3d_0.3-44       
## [103] spatstat.sparse_3.1-0       ape_5.8                     viridisLite_0.4.2          
## [106] xtable_1.8-4                highr_0.11                  car_3.1-2                  
## [109] fastcluster_1.2.6           plyr_1.8.9                  httr_1.4.7                 
## [112] rbibutils_2.2.16            tools_4.4.1                 globals_0.16.3             
## [115] SeuratObject_5.0.2          kde1d_1.0.7                 beeswarm_0.4.0             
## [118] htmlTable_2.4.3             broom_1.0.6                 checkmate_2.3.2            
## [121] rgl_1.3.1                   assertthat_0.2.1            lme4_1.1-35.5              
## [124] rngWELL_0.10-9              digest_0.6.37               permute_0.9-7              
## [127] numDeriv_2016.8-1.1         bookdown_0.40               Matrix_1.7-0               
## [130] farver_2.1.2                reshape2_1.4.4              yulab.utils_0.1.7          
## [133] viridis_0.6.5               DirichletMultinomial_1.47.0 rpart_4.1.23               
## [136] cachem_1.1.0                polyclip_1.10-7             WGCNA_1.73                 
## [139] Hmisc_5.1-3                 generics_0.1.3              dynamicTreeCut_1.63-1      
## [142] parallelly_1.38.0           statmod_1.5.0               impute_1.79.0              
## [145] huge_1.3.5                  RcppHNSW_0.6.0              ScaledMatrix_1.13.0        
## [148] carData_3.0-5               minqa_1.2.8                 pbapply_1.7-2              
## [151] glmnet_4.1-8                spam_2.10-0                 utf8_1.2.4                 
## [154] gtools_3.9.5                shapes_1.2.7                readxl_1.4.3               
## [157] preprocessCore_1.67.0       inum_1.0-5                  ggsignif_0.6.4             
## [160] energy_1.7-12               gridExtra_2.3               shiny_1.9.1                
## [163] GenomeInfoDbData_1.2.12     memoise_2.0.1               rmarkdown_2.28             
## [166] scales_1.3.0                gld_2.6.6                   stabledist_0.7-2           
## [169] partykit_1.2-22             future_1.34.0               RANN_2.6.2                 
## [172] spatstat.data_3.1-2         rstudioapi_0.16.0           cluster_2.1.6              
## [175] spatstat.utils_3.1-0        hms_1.1.3                   fitdistrplus_1.2-1         
## [178] munsell_0.5.1               cowplot_1.1.3               colorspace_2.1-1           
## [181] ellipse_0.5.0               rlang_1.1.4                 quadprog_1.5-8             
## [184] copula_1.1-4                DelayedMatrixStats_1.27.3   sparseMatrixStats_1.17.2   
## [187] dotCall64_1.1-1             scuttle_1.15.4              mgcv_1.9-1                 
## [190] xfun_0.47                   e1071_1.7-16                TH.data_1.1-2              
## [193] iterators_1.0.14            rARPACK_0.11-0              abind_1.4-8                
## [196] libcoin_1.0-10              treeio_1.29.1               gmp_0.7-5                  
## [199] DECIPHER_3.1.4              Rdpack_2.6.1                promises_1.3.0             
## [202] RSQLite_2.3.7               sandwich_3.1-1              proxy_0.4-27               
## [205] DelayedArray_0.31.11        Rmpfr_0.9-5                 GO.db_3.20.0               
## [208] compiler_4.4.1              prettyunits_1.2.0           boot_1.3-31                
## [211] distributional_0.5.0        beachmat_2.21.6             rbiom_1.0.3                
## [214] listenv_0.9.1               Rcpp_1.0.13                 Rttf2pt1_1.3.12            
## [217] BiocSingular_1.21.4         tensor_1.5                  progress_1.2.3             
## [220] BiocParallel_1.39.0         insight_0.20.5              spatstat.random_3.3-2      
## [223] R6_2.5.1                    fastmap_1.2.0               multcomp_1.4-26            
## [226] rstatix_0.7.2               vipor_0.4.7                 ROCR_1.0-11                
## [229] rsvd_1.0.5                  nnet_7.3-19                 gtable_0.3.5               
## [232] KernSmooth_2.23-24          miniUI_0.1.1.1              deldir_2.0-4               
## [235] htmltools_0.5.8.1           ggthemes_5.1.0              RcppParallel_5.1.9         
## [238] bit64_4.5.2                 spatstat.explore_3.3-2      lifecycle_1.0.4            
## [241] nloptr_2.1.1                sass_0.4.9                  vctrs_0.6.5                
## [244] VGAM_1.1-12                 slam_0.1-53                 spatstat.geom_3.3-3        
## [247] sp_2.1-4                    future.apply_1.11.2         pracma_2.4.4               
## [250] rvinecopulib_0.6.3.1.1      bslib_0.8.0                 pillar_1.9.0               
## [253] locfit_1.5-9.10             jsonlite_1.8.9              expm_1.0-0