5 Batch Effect Correction

Integration can be a subtle exercise. We need to balance our interest in seeing similarities between datasets with the risk of making things seem more similar than they really are. Simulation can help navigate this subtlety by letting us see how integration methods would behave in situations where we know exactly how the different datasets are related. This note will illustrate this perspective by showing how simulation can help with both horizontal (across batches) and vertical (across assays) integration. We’ll have a brief interlude on the map function in the purrr, which is helpful for concisely writing code that would otherwise need for loops (e.g., over batches or assays).

5.1 Anaerobic Digestion Data

This example is about simultaneously analyzing several batches in a dataset about the efficiency of anaerobic digestion (AD) of organic matter. The essential problem is that, in this study, the samples could not be collected simultaneously. Small differences across separate runs could lead to systematic differences in the resulting data, which can obfuscate the more interesting between-group variation that the experiment was intended to uncover. For example, in the AD dataset, the date of the sequencing run has a global effect on measured community composition, which we can see right away from a principal components plot:

data(anaerobic)
pca_batch(anaerobic, facet = FALSE) +
  scale_color_manual(values = c("#191C59", "#bc0c3c")) +
  labs(
    col = "Treatment",
    shape = "Batch",
    x = "PC1 [25.2% Variance Explained]",
    y = "PC2 [16.2% Varianced Explained]"
  )

You can learn more about the general microbiome batch effect integration problem in (Wang and Le Cao, 2020), which is where this dataset example and the batch effect correction code below comes from. The article also reviews mechanisms that could lead to batch effects in microbiome data, together with methods for removing these effects and the situations within which they are most appropriate.

In batch effect correction, it’s important to remove as much of the batch variation as possible without accidentally also removing the real biological variation that would have been present even if all the samples had been sequenced together. This is sometimes called ``overintegration,’’ and this is an especially high risk if some of the real biological variation is quite subtle, e.g., a rare cell type or one that is very similar to a more prominent one. Simulation can help us gauge the extent to which different methods may or may not overintegrate. Since we get to control the between-batch and and between-biological-condition differences, we can see the extent to which integration methods can remove the former while preserving the latter.

5.2 Simulation Design

The block below estimates a candidate simulator. By using the formula ~ batch + treatment, we’re allowing for taxon-wise differences due to batch and treatment. Note that in principle, we could estimate an interaction between batch and treatment (the treatment could appear stronger in some batches than others). I encourage you to try estimating that model; however, visually analyzing the output suggests that this full model has a tendancy to overfit. Since the data have already been centered log-ratio transformed, we can try out a Gaussian marginal model. The AD dataset has relatively few samples compared to the number of features, so we’ll use a copula that’s designed for this setting.

simulator <- setup_simulator(
  anaerobic,
  ~ batch + treatment,
  ~ GaussianLSS(),
  copula = copula_adaptive(thr = .1)
) |>
  estimate(nu = 0.05, mstop = 100) # lower nu -> stable training

We can simulate from the fitted model and evaluate the quality of our fit using contrast_boxplot. This is a light wrapper of the ggplot2 code we used to compare experiments from our first session, and you can read its definition here.

anaerobic_sim <- sample(simulator)
contrast_boxplot(anaerobic, anaerobic_sim)

Exercise Propose and create at least one other visualization that can be used to compare and contrast the simulator with real data. What conclusions can you draw?

Solution: There are many possible answers:

  • Boxplots across taxa with different overall abundance levels.
  • Analogous histograms or CDF plots, to show the entire distributions, rather than just summarized quantiles.
  • Pair scatterplots, to see how well the bivariate relationships between taxa are preserved.
  • Dimensionality reduction on the simulated data, to see how well it matches global structure in the original data.

We’ll implement the last idea using PCA. This should be contrasted with the PCA plot on the original data above. It’s okay if the plot seems rotated relative to the oiginal plot – PCA is only unique up to rotation. The main characteristic we’re looking for is that the relative sizes of the batch and treatment effects seem reasonaly well-preserved, since these will be the types of effects that our later batch effect integration methods must be able to distinguish.

pca_batch(anaerobic_sim)

In addition to visual evaluation, we can also perform a narrow evaluation using the parameter estimations from RUV. The idea here is that the parameter estimation from RUV should be similar for simulation data and real data. In the RUV model, we estimate the treatment effect for each feature in simulation and real data sets and are plotted in the scatter plot along with the correlation value between the two. As seen in the figure there is good agreement with the parameter estimations from the two datasets with a correlation value of 0.66.

fit_sim <- ruv::RUV4(t(assay(anaerobic_sim)), colData(anaerobic_sim)$treatment, k = 0)
fit_org <- ruv::RUV4(t(assay(anaerobic)), colData(anaerobic)$treatment, k = 0)


bind_cols(org = c(fit_org$betahat), sim = c(fit_sim$betahat)) |>
  ggplot(aes(org, sim)) +
  geom_abline(color = "blue", lwd = 1.5, lty = 2) +
  geom_point(size = 2, shape = 16) +
  theme_classic() +
  ylab("Estimated beta values from simulated data") +
  xlab("Estimated beta values from real data") +
  stat_cor(method = "pearson", label.x = -1.4, label.y = 0.7)
:::

To study the risk for overintegration, let’s imagine that there were a third treatment group with relatively fewer samples. This is the type of group that a correction method might accidentally blend in with the rest, if it’s too aggressive. We’ve defined the imaginary experiment using the data.frame below. The treatment level 1.8 is the new one. We’ve supposed there are between 1 - 3 technical replicates (extraction) for each biological sample (sample), and the batch dates are the same as before.

data(imaginary_design)
summary(imaginary_design)
##    extraction        batch      treatment           rep           sample   
##  Min.   :1    09/04/2015:36   Min.   :0.0000   Min.   :1.00   1      :  2  
##  1st Qu.:1    14/04/2016:36   1st Qu.:0.0000   1st Qu.:1.75   2      :  2  
##  Median :2    01/07/2016:36   Median :1.0000   Median :2.50   3      :  2  
##  Mean   :2    14/11/2016:36   Mean   :0.7167   Mean   :2.75   4      :  2  
##  3rd Qu.:3    21/09/2017:36   3rd Qu.:1.0000   3rd Qu.:4.00   5      :  2  
##  Max.   :3                    Max.   :1.8000   Max.   :5.00   6      :  2  
##                                                               (Other):168

We can simulate from the new design and look at how different this new treatment group seems from the others. It’s a subtle effect, definitely smaller than the batch effect, but also separate enough that we should be able to preserve it.

p <- list()
anaerobic_sim <- sample(simulator, new_data = imaginary_design)
p[["sim"]] <- pca_batch(anaerobic_sim)

5.3 Benchmarking

We’ve defined a batch_correct wrapper function that implements either the RUV-III or ComBat batch effect correction methods. Their outputs are contrasted in the PCAs below. It looks like ComBat might be somewhat too aggressive, causing the 1 and 1.8 treatment groups to substantially overlap, while RUV is a bit more conservative, keeping the treatment groups nicely separate. As an aside, we note that this conclusion can depend on the number of replicates and total number of samples available. We’ve included the code for generating the imaginary_design data.frame in a vignette for the MIGsim package. Can you find settings that lead either method astray?

p[["ruv"]] <- pca_batch(batch_correct(anaerobic_sim, "ruv"))
p[["combat"]] <- pca_batch(batch_correct(anaerobic_sim, "combat"))
prediction_errors <- function(df) {
  y_hat <- predict(lda(treatment ~ PC1 + PC2, data = df), type = "response")$class
  table(df$treatment, y_hat)
}

map(p, ~ prediction_errors(.$data))
## $sim
##      y_hat
##        0  1 1.8
##   0   62 13   0
##   1   15 56   4
##   1.8  1 14  15
## 
## $ruv
##      y_hat
##        0  1 1.8
##   0   70  5   0
##   1    2 67   6
##   1.8  0  8  22
## 
## $combat
##      y_hat
##        0  1 1.8
##   0   65 10   0
##   1   12 59   4
##   1.8  1 12  17
## R version 4.4.1 Patched (2024-08-21 r87049)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
##  [1] splines   parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.3                 readr_2.1.5                     tidyverse_2.0.0                 doRNG_1.8.6                     rngtools_1.5.2                 
##  [6] foreach_1.5.2                   ggpubr_0.6.0                    ggrepel_0.9.6                   edgeR_4.3.16                    limma_3.61.10                  
## [11] ANCOMBC_2.7.0                   stringr_1.5.1                   ks_1.14.3                       tidySummarizedExperiment_1.15.1 ttservice_0.4.1                
## [16] SimBench_0.99.1                 wesanderson_0.3.7               mia_1.13.36                     MultiAssayExperiment_1.31.5     TreeSummarizedExperiment_2.13.0
## [21] Biostrings_2.73.1               XVector_0.45.0                  SingleCellExperiment_1.27.2     gamlss_5.4-22                   nlme_3.1-166                   
## [26] gamlss.dist_6.1-1               gamlss.data_6.0-6               scDesigner_0.0.0.9000           purrr_1.0.2                     MIGsim_0.0.0.9000              
## [31] tidyr_1.3.1                     tibble_3.2.1                    scico_1.5.0                     pwr_1.3-0                       patchwork_1.3.0                
## [36] mutoss_0.1-13                   mvtnorm_1.3-1                   mixOmics_6.29.0                 lattice_0.22-6                  MASS_7.3-61                    
## [41] glue_1.8.0                      ggplot2_3.5.1                   ggdist_3.3.2                    gamboostLSS_2.0-7               mboost_2.9-11                  
## [46] stabs_0.6-4                     forcats_1.0.0                   dplyr_1.1.4                     SummarizedExperiment_1.35.1     Biobase_2.65.1                 
## [51] GenomicRanges_1.57.1            GenomeInfoDb_1.41.1             IRanges_2.39.2                  S4Vectors_0.43.2                BiocGenerics_0.51.1            
## [56] MatrixGenerics_1.17.0           matrixStats_1.4.1               SpiecEasi_1.1.3                 CovTools_0.5.4                 
## 
## loaded via a namespace (and not attached):
##   [1] igraph_2.0.3                ica_1.0-3                   plotly_4.10.4               Formula_1.2-5               scater_1.33.4               zlibbioc_1.51.1            
##   [7] tidyselect_1.2.1            bit_4.5.0                   doParallel_1.0.17           pspline_1.0-20              blob_1.2.4                  S4Arrays_1.5.8             
##  [13] png_0.1-8                   plotrix_3.8-4               cli_3.6.3                   CVXR_1.0-14                 pulsar_0.3.11               arrayhelpers_1.1-0         
##  [19] multtest_2.61.0             goftest_1.2-3               textshaping_0.4.0           bluster_1.15.1              BiocNeighbors_1.99.1        uwot_0.2.2                 
##  [25] mime_0.12                   evaluate_1.0.0              tidytree_0.4.6              leiden_0.4.3.1              stringi_1.8.4               backports_1.5.0            
##  [31] XML_3.99-0.17               lmerTest_3.1-3              Exact_3.3                   gsl_2.1-8                   httpuv_1.6.15               AnnotationDbi_1.67.0       
##  [37] magrittr_2.0.3              mclust_6.1.1                ADGofTest_0.3               pcaPP_2.0-5                 rootSolve_1.8.2.4           sctransform_0.4.1          
##  [43] lpSolve_5.6.21              ggbeeswarm_0.7.2            DBI_1.2.3                   genefilter_1.87.0           jquerylib_0.1.4             withr_3.0.1                
##  [49] corpcor_1.6.10              systemfonts_1.1.0           class_7.3-22                lmtest_0.9-40               sva_3.53.0                  compositions_2.0-8         
##  [55] htmlwidgets_1.6.4           fs_1.6.4                    labeling_0.4.3              SparseArray_1.5.38          DEoptimR_1.1-3              cellranger_1.1.0           
##  [61] DESeq2_1.45.3               tidybayes_3.0.7             extrafont_0.19              annotate_1.83.0             lmom_3.0                    flare_1.7.0.1              
##  [67] reticulate_1.39.0           minpack.lm_1.2-4            geigen_2.3                  zoo_1.8-12                  knitr_1.48                  nnls_1.5                   
##  [73] UCSC.utils_1.1.0            svUnit_1.0.6                SHT_0.1.8                   timechange_0.3.0            decontam_1.25.0             fansi_1.0.6                
##  [79] grid_4.4.1                  rhdf5_2.49.0                data.table_1.16.0           ruv_0.9.7.1                 vegan_2.6-8                 RSpectra_0.16-2            
##  [85] irlba_2.3.5.1               extrafontdb_1.0             DescTools_0.99.56           ellipsis_0.3.2              fastDummies_1.7.4           ade4_1.7-22                
##  [91] lazyeval_0.2.2              yaml_2.3.10                 survival_3.7-0              scattermore_1.2             crayon_1.5.3                mediation_4.5.0            
##  [97] tensorA_0.36.2.1            phyloseq_1.49.0             RcppAnnoy_0.0.22            RColorBrewer_1.1-3          progressr_0.14.0            later_1.3.2                
## [103] ggridges_0.5.6              codetools_0.2-20            base64enc_0.1-3             Seurat_5.1.0                KEGGREST_1.45.1             Rtsne_0.17                 
## [109] shape_1.4.6.1               randtoolbox_2.0.4           foreign_0.8-87              pkgconfig_2.0.3             xml2_1.3.6                  spatstat.univar_3.0-1      
## [115] downlit_0.4.4               scatterplot3d_0.3-44        spatstat.sparse_3.1-0       ape_5.8                     viridisLite_0.4.2           xtable_1.8-4               
## [121] highr_0.11                  car_3.1-2                   fastcluster_1.2.6           plyr_1.8.9                  httr_1.4.7                  rbibutils_2.2.16           
## [127] tools_4.4.1                 globals_0.16.3              SeuratObject_5.0.2          kde1d_1.0.7                 beeswarm_0.4.0              htmlTable_2.4.3            
## [133] broom_1.0.6                 checkmate_2.3.2             rgl_1.3.1                   assertthat_0.2.1            lme4_1.1-35.5               rngWELL_0.10-9             
## [139] digest_0.6.37               bayesm_3.1-6                permute_0.9-7               numDeriv_2016.8-1.1         bookdown_0.40               Matrix_1.7-0               
## [145] tzdb_0.4.0                  farver_2.1.2                reshape2_1.4.4              yulab.utils_0.1.7           viridis_0.6.5               DirichletMultinomial_1.47.0
## [151] rpart_4.1.23                cachem_1.1.0                polyclip_1.10-7             WGCNA_1.73                  Hmisc_5.1-3                 generics_0.1.3             
## [157] dynamicTreeCut_1.63-1       parallelly_1.38.0           biomformat_1.33.0           statmod_1.5.0               impute_1.79.0               huge_1.3.5                 
## [163] ragg_1.3.3                  RcppHNSW_0.6.0              ScaledMatrix_1.13.0         carData_3.0-5               minqa_1.2.8                 pbapply_1.7-2              
## [169] glmnet_4.1-8                spam_2.10-0                 utf8_1.2.4                  gtools_3.9.5                shapes_1.2.7                readxl_1.4.3               
## [175] preprocessCore_1.67.0       inum_1.0-5                  ggsignif_0.6.4              energy_1.7-12               gridExtra_2.3               shiny_1.9.1                
## [181] GenomeInfoDbData_1.2.12     rhdf5filters_1.17.0         memoise_2.0.1               rmarkdown_2.28              scales_1.3.0                gld_2.6.6                  
## [187] stabledist_0.7-2            partykit_1.2-22             svglite_2.1.3               future_1.34.0               RANN_2.6.2                  spatstat.data_3.1-2        
## [193] rstudioapi_0.16.0           cluster_2.1.6               spatstat.utils_3.1-0        hms_1.1.3                   fitdistrplus_1.2-1          munsell_0.5.1              
## [199] cowplot_1.1.3               colorspace_2.1-1            ellipse_0.5.0               rlang_1.1.4                 quadprog_1.5-8              copula_1.1-4               
## [205] DelayedMatrixStats_1.27.3   sparseMatrixStats_1.17.2    dotCall64_1.1-1             scuttle_1.15.4              mgcv_1.9-1                  xfun_0.47                  
## [211] coda_0.19-4.1               e1071_1.7-16                TH.data_1.1-2               posterior_1.6.0             iterators_1.0.14            rARPACK_0.11-0             
## [217] abind_1.4-8                 libcoin_1.0-10              treeio_1.29.1               gmp_0.7-5                   Rhdf5lib_1.27.0             DECIPHER_3.1.4             
## [223] Rdpack_2.6.1                promises_1.3.0              RSQLite_2.3.7               sandwich_3.1-1              proxy_0.4-27                DelayedArray_0.31.11       
## [229] Rmpfr_0.9-5                 GO.db_3.20.0                compiler_4.4.1              prettyunits_1.2.0           boot_1.3-31                 distributional_0.5.0       
## [235] beachmat_2.21.6             rbiom_1.0.3                 listenv_0.9.1               Rcpp_1.0.13                 Rttf2pt1_1.3.12             BiocSingular_1.21.4        
## [241] tensor_1.5                  progress_1.2.3              BiocParallel_1.39.0         insight_0.20.5              spatstat.random_3.3-2       R6_2.5.1                   
## [247] fastmap_1.2.0               multcomp_1.4-26             rstatix_0.7.2               vipor_0.4.7                 ROCR_1.0-11                 rsvd_1.0.5                 
## [253] nnet_7.3-19                 gtable_0.3.5                KernSmooth_2.23-24          miniUI_0.1.1.1              deldir_2.0-4                htmltools_0.5.8.1          
## [259] ggthemes_5.1.0              RcppParallel_5.1.9          bit64_4.5.2                 spatstat.explore_3.3-2      lifecycle_1.0.4             nloptr_2.1.1               
## [265] sass_0.4.9                  vctrs_0.6.5                 robustbase_0.99-4           VGAM_1.1-12                 slam_0.1-53                 spatstat.geom_3.3-3        
## [271] sp_2.1-4                    future.apply_1.11.2         pracma_2.4.4                rvinecopulib_0.6.3.1.1      bslib_0.8.0                 pillar_1.9.0               
## [277] locfit_1.5-9.10             jsonlite_1.8.9              expm_1.0-0