4 Microbiome Networks

Unlike human social networks, there is no simple way to observe microbe-microbe interactions – we have to make do with indirect evidence. One approach uses population profiles as a proxy for ecological interaction. Taxa that often co-occur are understood to have cooperative ecological interactions, while those that don’t are thought to compete for the same niche.

Many algorithms have been designed around this intuition, all trying to go beyond simple co-occurrence and instead capture more complex types of dependence. A challenge in practice is that it’s hard to know which method to use when, since the problem is unsupervised. Even when thorough simulation benchmarking studies are available, it’s often not obvious how well those simulation setups match our problems of interest.

4.1 Estimation

Let’s use simulation to benchmark network estimation methods using data from rounds 1 and 2 of the American Gut Project. We will simulate data with known correlation structure and taxa-level marginals estimated from the study data. The block below reads in the data.

data(amgut)
amgut
## class: SummarizedExperiment 
## dim: 45 261 
## metadata(0):
## assays(1): counts
## rownames(45): 326792 181016 ... 364563 301645
## rowData names(0):
## colnames(261): 000001879.1076223 000002437.1076448 ... 000002656.1076186 000001582.1076447
## colData names(167): X.SampleID BarcodeSequence ... Description sequencing_depth

We’ve estimated a zero-inflated negative binomial location-shape-scale (ZINBLSS) models for each taxon, using a gaussian copula to capture dependence. We have used the regression formula ~log(sequencing_depth) + BMI. The data structure below captures all the simulator components, and we can swap pieces in and out to modify the form of the simulator. For example, if we wanted, we could mutate the family and link function associated with particular features.

sim <- setup_simulator(
  amgut,
  ~ log(sequencing_depth) + BMI,
  ~ ZINBLSS()
) |>
  estimate(mstop = 100)
sim
## [Marginals]
## Plan:
## # A tibble: 6 × 3
##     feature              family                         link
##   <gene_id>             <distn>                       <link>
## 1    326792 ZINBI [mu,sigma,nu] ~log(sequencing_depth) + BMI
## 2    181016 ZINBI [mu,sigma,nu] ~log(sequencing_depth) + BMI
## 3    191687 ZINBI [mu,sigma,nu] ~log(sequencing_depth) + BMI
## 4    326977 ZINBI [mu,sigma,nu] ~log(sequencing_depth) + BMI
## 5    194648 ZINBI [mu,sigma,nu] ~log(sequencing_depth) + BMI
## 6    541301 ZINBI [mu,sigma,nu] ~log(sequencing_depth) + BMI
## 
## Estimates:
## # A tibble: 3 × 2
##   feature fit       
##   <chr>   <list>    
## 1 326792  <glmbsLSS>
## 2 181016  <glmbsLSS>
## 3 191687  <glmbsLSS>
## ... and 42 additional features.
## 
## [Dependence]
## 1 normalCopula with 45 features
## 
## [Template Data]
## class: SummarizedExperiment 
## dim: 45 261 
## metadata(0):
## assays(1): counts
## rownames(45): 326792 181016 ... 364563 301645
## rowData names(0):
## colnames(261): 000001879.1076223 000002437.1076448 ... 000002656.1076186 000001582.1076447
## colData names(167): X.SampleID BarcodeSequence ... Description sequencing_depth

The simulated data is always a SummarizedExperiment. This means that any workflow that applied to the original data can be applied to the simulated one without any changes. Notice also that sample defaults to drawing samples from the same design as the original input experiment (we’ll modify this using the new_data argument in a minute).

simulated <- sample(sim)
simulated
## # A SummarizedExperiment-tibble abstraction: 11,745 × 170
## # Features=45 | Samples=261 | Assays=counts_1
##    .feature .sample    counts_1 X.SampleID BarcodeSequence LinkerPrimerSequence CARBOHYDRATE_PER LAST_TRAVEL NONFOODALLERGIES_BEE…¹ ASSIGNED_FROM_GEO RUN_DATE NONFOODALLERGIES_DRUG   AGE TOT_MASS
##    <chr>    <chr>         <dbl>      <dbl> <fct>           <fct>                           <dbl> <fct>       <fct>                  <fct>             <fct>    <fct>                 <dbl> <fct>   
##  1 326792   000001879…      257      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
##  2 181016   000001879…       32      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
##  3 191687   000001879…        4      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
##  4 326977   000001879…        0      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
##  5 194648   000001879…      206      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
##  6 541301   000001879…      254      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
##  7 353985   000001879…        0      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
##  8 90487    000001879…      129      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
##  9 352304   000001879…        1      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
## 10 191541   000001879…       17      1879. GAACAAAGAGCG    GTGTGCCAGCMGCCGCGGT…               23 More than … None                   n                 5/20/13  yes                      52 54      
## # ℹ 35 more rows
## # ℹ abbreviated name: ¹​NONFOODALLERGIES_BEESTINGS
## # ℹ 156 more variables: GENERAL_MEDS <fct>, BODY_SITE <fct>, MIGRAINE_FACTOR_2 <fct>, MIGRAINE_FACTOR_3 <fct>, TONSILS_REMOVED <fct>, MIGRAINE_FACTOR_1 <fct>, ELEVATION <fct>,
## #   PET_LOCATIONS <fct>, FLOSSING_FREQUENCY <fct>, PROTEIN_PER <fct>, CSECTION <fct>, RACE <fct>, COUNTRY_OF_BIRTH <fct>, MACRONUTRIENT_PCT_TOTAL <dbl>, COSMETICS_FREQUENCY <fct>,
## #   HOST_SUBJECT_ID <fct>, GLUTEN <fct>, ZIP <fct>, LONGITUDE <dbl>, PET_CONTACT <fct>, MIGRAINE_AGGRAVATION <fct>, DIABETES_MEDICATION <fct>, SAMPLE_PLATE <fct>, KEY_SEQ <fct>,
## #   BODY_PRODUCT <fct>, REGION <fct>, MIGRAINE_FREQUENCY <fct>, MIGRAINE_RELATIVES <fct>, PLANT_PER <dbl>, MIGRAINE_PHOTOPHOBIA <fct>, SITE_SAMPLED <fct>, CURRENT_RESIDENCE_DURATION <fct>,
## #   COUNTRY <fct>, ANTIBIOTIC_SELECT <fct>, MIGRAINE_NAUSEA <fct>, COMMON_NAME <fct>, FOODALLERGIES_TREENUTS <fct>, MIGRAINE_AURA <fct>, TARGET_GENE <fct>, SEQUENCING_METH <fct>, …

4.2 Evaluation

Let’s compare the marginal count distributions for the real and simulated data. We’ll need the data in “long” format to be able to make the ggplot2 figure. The pivot_experiment helper can transform the original SummarizedExperiment objects in this way. Notice that the simulated data tends to overestimate the number of zeros in the high-abundance taxa. To refine the simulator, we should probably replace the zero-inflated negative binomial with ordinary negative binomials for these poorly fitted taxa.

bind_rows(
  real = pivot_experiment(amgut),
  simulated = pivot_experiment(simulated),
  .id = "source"
) |>
  filter(feature %in% rownames(simulated)[1:20]) |>
  ggplot() +
  geom_histogram(
    aes(log(1 + value), fill = source),
    position = "identity", alpha = 0.8
  ) +
  facet_wrap(~ reorder(feature, value), scales = "free")

Are the learned relationships with BMI plausible? We can compare scatterplots of the real and simulated data against this variable. Note that, by default, the ribbons will be evaluated along all variables, which makes for the jagged ribbons (neighboring values for BMI might have different sequencing depth, potentially leading to quite different predictions). To remove this artifact, we can assume that all samples had exactly the same sequencing depth.

new_data <- colData(amgut) |>
  as_tibble() |>
  mutate(sequencing_depth = 2e4)
plot(sim, "BMI", sample(sim, new_data = new_data), new_data = new_data)

We next visualize the correlation matrix estimated by the simulator’s copula model. There are a few pairs of taxa that are very highly correlated, and there are also a few taxa that seem to have higher correlation across a large number of taxa (e.g., the taxon in row 34). There is no obvious banding or block structure in this real data, though.

rho <- copula_parameters(sim)
heatmap(rho)

The pair below is one of those with high positive correlation. You can replace the selection with the commented out line to see what one of the anticorrelated pairs of taxa looks like.

# taxa <- rownames(amgut)[c(33, 43)]
taxa <- rownames(amgut)[c(14, 25)]
pivot_experiment(amgut) |>
  filter(feature %in% taxa) |>
  pivot_wider(names_from = feature) |>
  ggplot() +
  geom_point(aes(log(1 + .data[[taxa[1]]]), log(1 + .data[[taxa[2]]])))

4.3 Block Covariance

Let’s replace the current copula correlation structure with one from a block diagonal matrix. In this example, the off-diagonal correlations are 0.6. We can use mutate_correlation to swap this new correlation matrix into our earlier simulator.

rho <- c(0.4, .6, 0.8) |>
  map(~ matrix(., nrow = 15, ncol = 15)) |>
  Matrix::bdiag() |>
  as.matrix()
diag(rho) <- 1

simulated <- sim |>
  mutate_correlation(rho) |>
  sample()

x <- t(assay(simulated))

Let’s first look at the SpiecEasi covariance estimate. This is a variant of the graphical lasso that is designed to be well-adapted to microbiome data. The good news is that it does warn that the default choices of \(\lambda\) are too large, which is correct in this case. Unfortunately, it took a while to get this answer, and we had already been quite generous in allowing it to fit 10 choices of \(\lambda\).

rho_se <- spiec.easi(x, nlambda = 10, pulsar.params = list(rep.num = 1)) |>
  getOptCov() |>
  as.matrix() |>
  cov2cor()
heatmap(rho_se)

Let’s instead use the Ledoit-Wolf estimator on the log-transformed data. The results make much more sense.

rho_lw <- CovEst.2003LW(log(1 + x))$S |>
  cov2cor()
heatmap(rho_lw)

Since color comparisons are difficult to evaluate precisely, we can also make a scatterplot comparing the different covariance estimators.

data.frame(truth = c(rho), se = c(rho_se), lw = c(rho_lw)) |>
  pivot_longer(-truth, values_to = "estimate") |>
  ggplot() +
  geom_jitter(aes(truth, estimate, col = name), alpha = 0.6, size = 0.4) +
  geom_abline(slope = 1) +
  facet_wrap(~name)

4.4 Generalization

What about other network structures? We can use the same logic to evaluate several network regimes. For example, the block below defines correlation matrices associated with scale-free and banded structures.

data(example_rho)
rho_hat <- list()
for (r in seq_along(example_rho)) {
  x <- sim |>
    mutate_correlation(example_rho[[r]]) |>
    scDesigner::sample() |>
    assay()

  rho_se <- spiec.easi(t(x), nlambda = 10, pulsar.params = list(rep.num = 1)) |>
    getOptCov() |>
    as.matrix() |>
    cov2cor()
  rho_lw <- CovEst.2003LW(log(1 + t(x)))$S |>
    cov2cor()
  rho_hat[[names(example_rho)[r]]] <- list(se = rho_se, lw = rho_lw)
}

This example shows that, when we start with real template data, it’s not too hard to setup a benchmarking experiment. It’s generally easier to reconfigure the components of an existing simulator than it is to specify all the simulation steps from scratch. There is the secondary bonus that the data tend to look close to real data of interest, at least up to the deliberate transformations needed to establish ground truth.

We could imagine extending this example to include different data properties (sample sizes, variable block sizes and correlations, more general correlation structure) and estimation strategies (alternative transformations or estimators). Design changes could be implemented using expand_colData, changes in the signal can be specified as above with mutate_correlation, and any workflow can be used as long as it applies to a SummarizedExperiment object.