New Diagnostics for Dimensionality Reduction of Genomic Data

Kris Sankaran
University of Wisconsin-Madison
University of Michigan
Biostatistics Seminar
2026-02-19

Dimensionality Reduction

Modern omics data have many features, and dimensionality reduction methods help create overview visualizations.

They can summarize microbiome community structure

\(\def\Dir{\text{Dir}}\) \(\def\Mult{\text{Mult}}\) \(\def\*#1{\mathbf{#1}}\) \(\def\m#1{\boldsymbol{#1}}\) \(\def\Unif{\text{Unif}}\) \(\def\win{\tilde{w}_{\text{in}}}\) \(\def\reals{\mathbb{R}}\) \(\newcommand{\wout}{\tilde w_{\text{out}}}\)

Dimensionality Reduction

Modern omics data have many features, and dimensionality reduction methods help create overview visualizations.

They are also used in cell atlas construction and trajectory inference.

Challenges

These methods need to be used with caution:

  • The assumed number of latent communities can influence topic model interpretation.
  • UMAP can introduce visual artifacts, like spurious cell types.

Despite being widely used, there are few diagnostics for these dimensionality reduction methods.

Topic Alignment








Fukuyama, J., Sankaran, K., & Symul, L. (2023). Multiscale analysis of count data through topic alignment. Biostatistics (Oxford, England), 24(4), 1045–1065. doi:10.1093/biostatistics/kxac018

Model

Topic models suppose that samples \(x_i \in \mathbb{R}^{D}\) are drawn independently: \[\begin{align*} x_i \vert \gamma_i &\sim \text{Mult}\left(n_{i}, \*B\gamma_{i}\right) \\ \gamma_{i} &\sim \text{Dir}\left(\lambda_{\gamma} 1_{K}\right) \end{align*}\] where the columns \(\beta_{k}\) of \(\*B \in \Delta^{K}_{D}\) lie in the \(D\)-dimensional simplex and are drawn independently from, \[\begin{align*} \beta_{k} \sim \text{Dir}\left(\lambda_{\beta} 1_{D}\right). \end{align*}\]

Vertically stack the \(N\) \(\gamma_i\)’s into \(\Gamma \in \Delta^{N}_{K}\).

Microbiome Analogy

  • Topics: \(\beta_{k}\) describes \(K\) underlying community types.
  • Memberships: \(\gamma_{i}\) describes sample \(x_i\) as a mixture of community types.

Microbiome Analogy

  • Topics: \(\beta_{k}\) describes \(K\) underlying community types.
  • Memberships: \(\gamma_{i}\) describes sample \(x_i\) as a mixture of community types.

Microbiome Analogy

  • Topics: \(\beta_{k}\) describes \(K\) underlying community types.
  • Memberships: \(\gamma_{i}\) describes sample \(x_i\) as a mixture of community types.

Microbiome Analogy

  • Topics: \(\beta_{k}\) describes \(K\) underlying community types.
  • Memberships: \(\gamma_{i}\) describes sample \(x_i\) as a mixture of community types.

Example: Antibiotics Time Course

If we use topic models, Topic 2 increases during the antibiotic interventions, especially the first (Sankaran and Holmes 2018).

Choice of \(K\)

However, we stress that care should be taken in the interpretation of the inferred value of \(K\). To begin with, due to the very high dimensionality of the parameter space, we found it difficult to obtain reliable estimates of \(P\left(X \vert K\right)\)… There are also biological reasons to be careful interpreting \(K\).

– From (Pritchard, Stephens, and Donnelly 2000).

In practice, models are fit across many \(K\) and their goodness-of-fits are compared (Novembre 2016; Wallach et al. 2009; Lawson, Dorp, and Falush 2018).

alto: Main Idea

We fit models of varying complexity \(K\) and build a compact representation of the ensemble

Columns are models and rectangles are topics.

Alignment as a Graph

An alignment is a graph where nodes \(V\) represent topics \(\beta\left(v\right) \in \Delta^D\) and memberships \(\Gamma\left(v\right) = \left(\gamma_{ik}^{m}\right)_{i=1}^{n} \in \reals^{N}\) across models.

  • Edges \(E\) connect topics across complexities \(K \to K + 1\)
  • Weights \(W\) measure topic similarity

Estimating Weights

Let \(V_p\) and \(V_q\) be two subsets of topics within the graph.

  • Let the total “mass” of \(V_p\) be \(p = \left\{\Gamma\left(v\right)^T 1 : v \in V_{p}\right\}\).
  • Define the transport cost \(C\left(v, v^\prime\right) := JSD\left(\beta\left(v\right), \beta\left(v^\prime\right)\right)\), the Jensen-Shannon divergence between topics (Peyré and Cuturi 2018).

Estimating Weights

The weights \(W\) can be learned using optimal transport, \[\begin{align*} &\min_{W \in \mathcal{U}\left(p, q\right)} \left<C,W\right> \end{align*}\] \[\begin{align*} \mathcal{U}\left(p, q\right) := &\{W\in \mathbb{R}^{\left|V_{p}\right| \times \left|V_{q}\right|}_{+} : W 1_{\left|V_{q}\right|} = p \text{ and } W^{T} 1_{\left|V_{p}\right|} = q\}. \end{align*}\]

Path-based Diagnostics

  • Paths: Partitions the Sankey diagram into connected sets of topics.
  • Coherence: Average similarity of \(\beta\left(v\right)\) to other topics along the same a path. High coherence \(\to\) persistent structure.
  • Refinement: Mixing between descendants. High refinement \(\to\) genuine increase in complexity.

Examples: True Model

Below is the alignment for data from a topic model. Can you guess \(K\)?

  • \(N = 250, D = 1000, \lambda_{\gamma} = 0.5, \lambda_{\beta} = 0.1\)

Examples: True Model

The diagnostics suggest that the true \(K\) is 5.

Data Analysis Background

(Ravel et al. 2010) used clustering to identify 5 Community State Types in the vaginal microbiome.

(Ravel et al. 2010) grouped samples (columns) into CSTs.

Deconstructing CSTs

The follow-up study (Symul et al. 2023) had more samples than (Ravel et al. 2010) and so could identify additional structure lying behind CSTs.

  • They had 2179 samples from 135 women, sampled longitudinally.
  • The green and blue paths to the right reflect the known Lactobacillus CSTs.

Coherence Scores

Coherence is not a function of \(K\) alone.

Software

Topic alignment is implemented in the R package alto.

library(purrr)
library(alto)

# Define LDA parameters
params <- map(
  set_names(1:10),
  ~ list(k = .)
)
models <- run_lda_models(
  vm_data$counts,
  params
)
# Run alignment
result <- align_topics(
    models,
    method = "transport"
)
plot(result)

All the simulations discussed today are vignettes in the package.

Distortion Visualization








Sankaran, K., Zhang, S., Chenab, & Meilă, M. (2025). Interactive Visualization of Metric Distortion in Nonlinear Data Embeddings using the distortions Package. doi:10.1101/2025.08.21.671523

Distortions in \(t\)-SNE and UMAP

Both \(t\)-SNE and UMAP introduce distortions. For example, they may not preserve density within different regions of the plot.

Example from (Narayan, Berger, and Cho 2021).

Distortions in \(t\)-SNE and UMAP

They can also fail to preserve the topology of the underlying data…

Example from (Kobak and Linderman 2021).

Consequences

These distortions are not mere technical curiosities – they significantly impact scientific interpretation (Liu, Ma, and Zhong 2025; Kobak and Linderman 2021). For example, they create misleading differences between cell types that are actually similar.

Example from (Xia, Lee, and Li 2024).

Controversy

See also (Kozlov 2024; Irizarry 2024).

Approach

Rather than abandoning nonlinear dimensionality reduction, we augment the embeddings to characterize distortion.

This is a high-dimensional version of Tissot’s indicatrix from cartography (Laskowski 1989).

RMetric Motivation

The RMetric algorithm (Perrault-Joncas and Meila 2006; McQueen et al. 2016) quantifies distortion geometrically. To motivate the algorithm, consider the distortion induced by mapping the sphere into latitude/longitude coordinates.

Half-Sphere Parameterization

Parameterize points on \(\mathcal{M}\) using spherical coordinates:

\[\begin{align*} \mathbf{x}\left(p\right) = \left(\cos\varphi\cos\theta, \cos\varphi \sin\theta, \sin\varphi\right) \end{align*}\]

The associated (latitude, longitude) embedding is

\[\begin{align*} \mathbf{z}\left(p\right) = \left(\theta\left(p\right), \varphi\left(p\right)\right). \end{align*}\]

Pushforward Metric

What does a small step in the embedding space correspond to in \(\mathcal{M}\)? The pushforward metric answers this,

\[\begin{align*} g_{ij} = \left\langle \frac{\partial\mathbf{x}}{\partial z^i}, \frac{\partial\mathbf{x}}{\partial z^j}\right\rangle_{\mathbb{R}^3} \end{align*}\]

This varies across \(p \in \mathcal{M}\) but we suppress it from the notation.

Pushforward Metric

In the sphere example, these derivatives can be directly computed and to obtain,

\[\begin{align*} G = \begin{pmatrix} \cos^2\varphi & 0 \\ 0 & 1 \end{pmatrix} \end{align*}\] Near the equator (\(\varphi \approx 0\)), a small step in \(\theta\) covers more distance than near the north pole (\(\varphi \approx \frac{\pi}{2}\)).

Dual Pushforward Metric

Alternatively, the gradients \(\nabla z^i\) of the embedding dimensions also reflect distortion.

Since the level sets of \(z^\theta\) become more compressed near the poles, the gradients \(\nabla z^\theta\) become larger there.

Dual Pushforward Metric

This gradient information can be stored in the matrix \(H\) with elements,

\[\begin{align*} h^{ij} = \langle \nabla z^i, \nabla z^j \rangle_{g_{0}} \end{align*}\] where \(g_{0}\) is the metric on \(\mathcal{M}\) inherited from the ambient space.



\(H\) is computable from data while \(G\) requires an explicit manifold parameterization.

Dual Pushforward Metric

In our running example,

\[\begin{align*} H = \begin{pmatrix} 1/\cos^2\varphi & 0 \\ 0 & 1 \end{pmatrix} \end{align*}\]

We can see that \(H = G^{-1}\) and that is actually true more generally.

Product Rule for Laplacians

For any \(f\) and \(g\), the Laplacian \(\Delta\) satisfies, \[\begin{align*} \Delta(fg) = f\,\Delta g + g\,\Delta f + 2\langle \nabla f, \nabla g \rangle_{g_{0}} \end{align*}\]

Setting \(f = z^i\), \(g = z^j\) and rearranging, \[\begin{align*} h^{ij} = \langle \nabla z^i, \nabla z^j \rangle_{g_0} = \frac{1}{2}\left[\Delta(z^i z^j) - z^i\Delta z^j - z^j\Delta z^i\right] \end{align*}\]

Since there are methods for estimating \(\Delta\) from data (Hein, Audibert, and Luxburg 2005; Coifman and Lafon 2006), we also have a practical method for approximating local distortions \(H\)!

Implementation

Let \(z_{k} \in \mathbb{R}^{N}\) be the \(k^{th}\) embedding dimension. Let \(L\) be the doubly-normalized graph Laplacian (Coifman and Lafon 2006). Compute

\[\begin{align*} H_{kk'}^{(\cdot)} := \frac{1}{2}\left[L\left(z_{k} \circ z_{k'}\right) - z_{k} \circ \left(L z_{k'}\right) - z_{k'} \circ \left(L z_{k}\right) \right] \in \mathbb{R}^{N} \end{align*}\]

The embedding distortion for sample \(n\) is given by \(H^{(n)} \in \mathbb{R}^{K \times K}\)

Example

These two clusters are generated as:

\[\begin{align*} x_{i} \sim \frac{1}{2}\mathcal{N}\left(0, 10\right) + \frac{1}{2}\mathcal{N}\left(100, 1\right) \end{align*}\]

Example

The UMAP embeddings lose information about the cluster density, but the difference is captured in the local metrics.

Local Isometrization

Since the metrics are known locally, the distortion can be inverted within a neighborhood of the cursor. For example, here we interactively adapt the embeddings in the Gaussian mixtures example.

Fragmented Neighborhoods

Besides RMetric, we also visualize distortions using the scatterplot of true vs. embedding neighborhood distances.

Fragmented Neighborhoods

To detect poorly preserved neighborhoods, we fit a running median to true vs. embedding distances, flag outliers above \(3\times \text{IQR}\), and mark points with many outlier links as “broken.”

Examples

This is the classic Swiss Roll data, but with higher density near the endpoints.

Variable Density Swiss Roll

\(t\)-SNE (perplexity = 100) breaks the roll in the low-density region and artificially spreads the high density area.

Fragmented Neighborhoods

Poorly Preserved Distances

PBMC Dataset

This single-cell gene expression data set was used in the data visualization tutorial from the scanpy package (Wolf, Angerer, and Theis 2018). Each point is the UMAP embedding of a cell’s high-dimensional gene expression data.

PBMC Dataset

  • Distance scales vary both across and within clusters.
  • Two T-cell sets appear farther from Monocytes than they actually are.

Hydra Cell Atlas

In this hydra cell differentiation dataset (Siebert et al. 2019; Xia, Lee, and Li 2024), \(t\)-SNE (perplexity = 80) collapses points along the dataset periphery and exaggerates between-cluster distances

Hydra Cell Atlas

At perplexity = 500, the clusters are more reliable, but peripheral samples are in fact closer than they appear

Hydra Cell Atlas

The local isometry visualization highlights some “threads” are more spread in the original data.

DensMap vs. UMAP

Both variation in ellipses and fragmented neighborhood statistics can be used to compare competing algorithms, similarly to (Xia, Lee, and Li 2024; Venna and Kaski 2006).

This example uses data from a C. elegans cell differentiation study (Packer et al. 2019).

DensMap vs. UMAP

Both variation in ellipses and fragmented neighborhood statistics can be used to compare competing algorithms, similarly to (Xia, Lee, and Li 2024; Venna and Kaski 2006).

This example uses data from a C. elegans cell differentiation study (Packer et al. 2019).

Summary

  1. Topic alignment helps better understand the influence of \(K\) in exploratory analysis of count data.

  2. Interactivity can reveal distortion information based on the analyst’s priorities.

Papers: https://go.wisc.edu/oe3g62, https://go.wisc.edu/tify36

Packages: https://lasy.github.io/alto, https://pypi.org/project/distortions

Acknowledgments

  • Contact: ksankaran@wisc.edu
  • Lab Members: Margaret Thairu, Yuliang Peng, Langtian Ma, Cameron Jones, Jiaxin Ye, Megan Kuo, Helena Huang
  • Funding: NIGMS R01GM152744, NIAID R01AI184095, Gates INV-072185, NIH R01HG014687

Appendix

Focus-plus-Context

The focus-plus-context principle (Heer and Card 2004) states that readers should be able to zoom into patterns of interest without losing relevant context.

Focus-plus-Context: Barplots

Stacked barplots visualize sample-to-sample variation in microbiome community structure. They struggle to show finer taxonomic resolutions.

Focus-plus-Context: Barplots

Stacked barplots visualize sample-to-sample variation in microbiome community structure. They struggle to show finer taxonomic resolutions.

Phylobar

Applying focus-plus-context reveals rare taxa abundances while preserving overall structure

Paths

For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]

Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.

Paths

For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]

Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.

Paths

For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]

Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.

Paths

For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]

Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.

Paths

For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]

Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.

Coherence

A topic’s coherence measures its average similarity to other topics along the same path,

\[\begin{align*} c(v) = \frac{1}{|\text{Path}\left(v\right)|} \sum_{v' \in \text{Path}\left(v\right)} \min\left(\win\left(v, v'\right), \wout\left(v, v'\right) \right) \end{align*}\]

Transient topics (appearing or vanishing across \(K\)) score low, while consistently recovered topics score high.

Refinement

Parent specificity identifies two distinct regimes,

  • High Refinement: Each topic receives the most mass from a unique parent, corresponding to a true or “compromise” topic
  • Low Refinement: Each topic receives substantial mass from several parents, each corresponding to an arbitrary split of a true topic

Refinement

Parent specificity identifies two distinct regimes,

  • High Refinement: Each topic receives the most mass from a unique parent, corresponding to a true or “compromise” topic
  • Low Refinement: Each topic receives substantial mass from several parents, each corresponding to an arbitrary split of a true topic

\[\begin{align*} r(v)=\frac{\left|V_{m}\right|}{M-m} \sum_{m^{\prime}=m+1}^{M} \sum_{v_{m^{\prime}}^{\prime} \in V_{m^{\prime}}} \wout \left(v, v_{m^{\prime}}^{\prime}\right) \win\left(v, v_{m^{\prime}}^{\prime}\right) \end{align*}\]

Mammoth

This example comes from (Understanding UMAP — Pair-Code.github.io”; Noichl). The 3D skeleton scans were produced by the Smithsonian, and we can use nonlinear dimensionality reduction to “flatten” the skeleton into 2D.

Mammoth

This is the embedding when applying UMAP with a 50 nearest-neighbor graph and min_dist = 0.5.

Mammoth

Parts of the shoulders, head, and tail are further apart in the embedding compared to the original data. Most other distortions are points that are placed too close to one another.

Mammoth

Parts of the shoulders, head, and tail are further apart in the embedding compared to the original data. Most other distortions are points that are placed too close to one another.

Graph Laplacian

To compute \(L\), we use the estimator from (Coifman and Lafon 2006).

  1. Build the kernel matrix \(W_{kl} = \exp(-\|X_k - X_l\|^2 / h)\), where \(h\) is a bandwidth hyperparameter.

  2. Normalize both columns and rows. \[\begin{align*} D &= \text{diag}(W\mathbf{1}) \qquad \tilde{W} = D^{-1}WD^{-1} \\ \tilde{D} &= \text{diag}(\tilde{W}\mathbf{1}) \qquad L = \tilde{D}^{-1}\tilde{W} \end{align*}\] Column normalization accounts for differences in sampling density.

Derivation of \(G\)

First note that, \[\begin{align*} \frac{\partial\mathbf{x}}{\partial \theta} = \left(-\cos\varphi\sin\theta, \cos\varphi\cos\theta, 0\right). \end{align*}\] Therefore, \[\begin{align*} g_{11} &= \left\langle \frac{\partial\mathbf{x}}{\partial \theta}, \frac{\partial\mathbf{x}}{\partial \theta}\right\rangle_{\mathbb{R}^3} \\ &= \cos^2\varphi\sin^2\theta + \cos^2\varphi\cos^2\theta \\ &= \cos^2\varphi \end{align*}\]

Derivation of \(H\) from \(\Delta\left(fg\right)\)

By the product formula with \(z^1 = \theta\): \[\begin{align*} h^{11} = \langle \nabla \theta, \nabla \theta \rangle_{g_0} = \frac{1}{2}\left[\Delta(\theta^2) - 2\theta\Delta\theta\right] \end{align*}\]

Derivation of \(H\) from \(\Delta\left(fg\right)\)

The general formula for the Laplace-Beltrami operator is \[\Delta f = \frac{1}{\sqrt{\det G}}\sum_{i,j}\frac{\partial}{\partial z^i}\left(\sqrt{\det G}\, g^{ij}\frac{\partial f}{\partial z^j}\right).\]

Since \(\det(G) = \cos^2\varphi\) and the off-diagonal \(g^{ij}\) are zero, \[\begin{align*} \Delta f = \frac{1}{\cos^2\varphi}\frac{\partial^2 f}{\partial\theta^2} + \frac{1}{\cos\varphi}\frac{\partial}{\partial\varphi}\left(\cos\varphi\frac{\partial f}{\partial\varphi}\right) \end{align*}\]

Derivation of \(H\) from \(\Delta\left(fg\right)\)

We can plug in the choices of \(f\) that we care about, \[\begin{align*} \Delta\theta &= 0\\ \Delta(\theta^2) &= \frac{1}{\cos^2\varphi}\frac{\partial^2(\theta^2)}{\partial\theta^2} = \frac{2}{\cos^2\varphi} \end{align*}\]

and then substitute into the formula from 2 slides ago, \[\begin{align*} h^{11} = \frac{1}{2}\left[\frac{2}{\cos^2\varphi} - 2\theta \cdot 0\right] = \frac{1}{\cos^2\varphi}. \end{align*}\]

Interaction Gulfs

There are two classic challenges in interactive interfaces (Hutchins, Hollan, and Norman 1985),

  • Gulf of Execution: The gap between what you want to do and how you specify it in the system.

These challenges also apply to interactive data analysis.

Interaction Gulfs

There are two classic challenges in interactive interfaces (Hutchins, Hollan, and Norman 1985),

  • Gulf of Evaluation: The gap between what the system shows you and an understanding of what it represents.

These challenges also apply to interactive data analysis.

Specious Art

Nonlinear dimensionality reduction has become the source of widespread concern in the single-cell literature (Chari and Pachter 2023).

We introduce two new diagnostic visualizations.

alto plots for choice of \(K\) in topic models .

RMetric plots for nonlinear embeddings.

Both are grounded in the data visualization principles that we review next.

Example: Antibiotics Time Course

To interpret topics, look for representative taxa. These species have higher probabilities in one topic compared to the others.

Model with background variation

Topic alignment identifies departures from the assumed model. Consider the generative mechanism,

\[\begin{align*} x_{i} \vert \*B, \gamma_{i}, \nu_i &\sim \Mult\left(n_{i}, \alpha \*B\gamma_{i} + \left(1 - \alpha\right)\nu_i\right) \\ \nu_{i} &\sim \Dir\left(\lambda_{\nu}\right) \\ \gamma_i &\sim \Dir\left(\lambda_{\gamma}\right) \\ \beta_{k} &\sim \Dir\left(\lambda_{\beta}\right), \end{align*}\]

where \(\*B\) stacks the \(\beta_k\) rowwise.

Result

The alignment structure is sensitive to changes in \(\alpha\) and fragments when structure is not present.

Diagnostics

This structure is consistent across simulation runs, and the diagnostics quantify topic deterioration.

Dual Pushforward Metric

Alternatively, the gradients \(\nabla z^i\) of the embedding dimensions also reflect distortion.

In contrast, the gradients \(\nabla z^{\varphi}\) don’t depend on \(\varphi\).

Comparison with LOO-map

The stability-based algorithm (Liu, Ma, and Zhong 2025) gives a similar interpretation. But the visual encoding is more subtle, and the leave-one-out approach is time consuming even with approximations.

Distortions in \(t\)-SNE and UMAP

They can make high-dimensional random walks look artificially smooth…

Example from (Wattenberg, Viégas, and Johnson 2016).

References

Barnett, David J. M., Ilja C. W. Arts, and John Penders. 2021. “microViz: An r Package for Microbiome Data Visualization and Statistics.” Journal of Open Source Software 6 (63): 3201. https://doi.org/10.21105/joss.03201.
Bolyen, Evan, Jai Ram Rideout, Matthew R. Dillon, Nicholas A. Bokulich, Christian C. Abnet, Gabriel A. Al-Ghalith, Harriet Alexander, et al. 2019. “Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using QIIME 2.” Nature Biotechnology 37 (8): 852–57. https://doi.org/10.1038/s41587-019-0209-9.
Chari, Tara, and Lior Pachter. 2023. “The Specious Art of Single-Cell Genomics.” Edited by Jason A. Papin. PLOS Computational Biology 19 (8): e1011288. https://doi.org/10.1371/journal.pcbi.1011288.
Coifman, Ronald R., and Stéphane Lafon. 2006. “Diffusion Maps.” Applied and Computational Harmonic Analysis 21 (1): 5–30. https://doi.org/10.1016/j.acha.2006.04.006.
Fettweis, Jennifer M., Myrna G. Serrano, J. Paul Brooks, David J. Edwards, Philippe H. Girerd, Hardik I. Parikh, Bernice Huang, et al. 2019. “The Vaginal Microbiome and Preterm Birth.” Nature Medicine 25 (6): 1012–21. https://doi.org/10.1038/s41591-019-0450-2.
Gosmann, Christina, Melis N. Anahtar, Scott A. Handley, Mara Farcasanu, Galeb Abu-Ali, Brittany A. Bowman, Nikita Padavattan, et al. 2017. “Lactobacillus-Deficient Cervicovaginal Bacterial Communities Are Associated with Increased HIV Acquisition in Young South African Women.” Immunity 46 (1): 29–37. https://doi.org/10.1016/j.immuni.2016.12.013.
Gudnadottir, Unnur, Justine W. Debelius, Juan Du, Luisa W. Hugerth, Hanna Danielsson, Ina Schuppe-Koistinen, Emma Fransson, and Nele Brusselaers. 2022. “The Vaginal Microbiome and the Risk of Preterm Birth: A Systematic Review and Network Meta-Analysis.” Scientific Reports 12 (1). https://doi.org/10.1038/s41598-022-12007-9.
Heer, Jeffrey, and Stuart K. Card. 2004. “DOITrees Revisited: Scalable, Space-Constrained Visualization of Hierarchical Data.” In Proceedings of the Working Conference on Advanced Visual Interfaces, 421–24. AVI ’04. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/989863.989941.
Hein, Matthias, Jean-Yves Audibert, and Ulrike von Luxburg. 2005. “From Graphs to Manifolds – Weak and Strong Pointwise Consistency of Graph Laplacians.” In Learning Theory, 470–85. Springer Berlin Heidelberg. https://doi.org/10.1007/11503415_32.
Hutchins, Edwin L., James D. Hollan, and Donald A. Norman. 1985. “Direct Manipulation Interfaces.” Human–Computer Interaction 1 (4): 311–38. https://doi.org/10.1207/s15327051hci0104_2.
Irizarry, Rafael. 2024. Simply Statistics: Biologists, Stop Putting UMAP Plots in Your Papers — Simplystatistics.org.” https://simplystatistics.org/posts/2024-12-23-biologists-stop-including-umap-plots-in-your-papers/.
Kobak, Dmitry, and George C. Linderman. 2021. “Initialization Is Critical for Preserving Global Data Structure in Both t-SNE and UMAP.” Nature Biotechnology 39 (2): 156–57. https://doi.org/10.1038/s41587-020-00809-z.

References (cont.)

Kozlov, Max. 2024. ‘All of Us’ Genetics Chart Stirs Unease over Controversial Depiction of Race.” Nature, February. https://doi.org/10.1038/d41586-024-00568-w.
Kuo, Megan, Kim-Anh Lê Cao, Saritha Kodikara, Jiadong Mao, and Kris Sankaran. 2025. “Phylobar: An r Package for Multiresolution Compositional Barplots in Omics Studies,” November. https://doi.org/10.1101/2025.11.05.686662.
Laskowski, Piotr H. 1989. “The Traditional and Modern Look at Tissot’s Indicatrix.” The American Cartographer 16 (2): 123–33.
Lawson, Daniel J., Lucy van Dorp, and Daniel Falush. 2018. “A Tutorial on How Not to over-Interpret STRUCTURE and ADMIXTURE Bar Plots.” Nature Communications 9 (1). https://doi.org/10.1038/s41467-018-05257-7.
Liu, Zhexuan, Rong Ma, and Yiqiao Zhong. 2025. “Assessing and Improving Reliability of Neighbor Embedding Methods: A Map-Continuity Perspective.” Nature Communications 16 (1). https://doi.org/10.1038/s41467-025-60434-9.
McQueen, James, Marina Meilă, Jacob VanderPlas, and Zhongyue Zhang. 2016. “Megaman: Scalable Manifold Learning in Python.” Journal of Machine Learning Research 17 (148): 1–5.
Narayan, Ashwin, Bonnie Berger, and Hyunghoon Cho. 2021. “Assessing Single-Cell Transcriptomic Variability Through Density-Preserving Data Visualization.” Nature Biotechnology 39 (6): 765–74.
Noichl, Max. Max Noichl | Flattening Mammoths — Maxnoichl.eu.” https://www.maxnoichl.eu/projects/mammoth/.
Novembre, John. 2016. “Pritchard, Stephens, and Donnelly on Population Structure.” Genetics 204 (2): 391–93. https://doi.org/10.1534/genetics.116.195164.
Packer, Jonathan S., Qin Zhu, Chau Huynh, Priya Sivaramakrishnan, Elicia Preston, Hannah Dueck, Derek Stefanik, et al. 2019. “A Lineage-Resolved Molecular Atlas of c. Elegans Embryogenesis at Single-Cell Resolution.” Science 365 (6459). https://doi.org/10.1126/science.aax1971.
Perrault-Joncas, Dominique, and Marina Meila. 2006. “Metric Learning of Manifolds.” Semisupervised Learn 1: 293–306.
Peyré, Gabriel, and Marco Cuturi. 2018. “Computational Optimal Transport.” https://doi.org/10.48550/ARXIV.1803.00567.

References (cont.)

Pritchard, Jonathan K, Matthew Stephens, and Peter Donnelly. 2000. “Inference of Population Structure Using Multilocus Genotype Data.” Genetics 155 (2): 945–59. https://doi.org/10.1093/genetics/155.2.945.
Ravel, Jacques, Pawel Gajer, Zaid Abdo, G. Maria Schneider, Sara S. K. Koenig, Stacey L. McCulle, Shara Karlebach, et al. 2010. “Vaginal Microbiome of Reproductive-Age Women.” Proceedings of the National Academy of Sciences 108 (supplement_1): 4680–87. https://doi.org/10.1073/pnas.1002611107.
Sankaran, Kris, and Susan P Holmes. 2018. “Latent Variable Modeling for the Microbiome.” Biostatistics 20 (4): 599–614. https://doi.org/10.1093/biostatistics/kxy018.
Siebert, Stefan, Jeffrey A. Farrell, Jack F. Cazet, Yashodara Abeykoon, Abby S. Primack, Christine E. Schnitzler, and Celina E. Juliano. 2019. “Stem Cell Differentiation Trajectories in Hydra Resolved at Single-Cell Resolution.” Science 365 (6451). https://doi.org/10.1126/science.aav9314.
Symul, Laura, Pratheepa Jeganathan, Elizabeth K. Costello, Michael France, Seth M. Bloom, Douglas S. Kwon, Jacques Ravel, David A. Relman, and Susan Holmes. 2023. “Sub-Communities of the Vaginal Microbiota in Pregnant and Non-Pregnant Women.” Proceedings of the Royal Society B: Biological Sciences 290 (2011). https://doi.org/10.1098/rspb.2023.1461.
Understanding UMAP — Pair-Code.github.io.” https://pair-code.github.io/understanding-umap/.
Venna, Jarkko, and Samuel Kaski. 2006. “Local Multidimensional Scaling.” Neural Networks 19 (6–7): 889–99. https://doi.org/10.1016/j.neunet.2006.05.014.
Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. “Evaluation Methods for Topic Models.” International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:10910725.
Wattenberg, Martin, Fernanda Viégas, and Ian Johnson. 2016. “How to Use t-SNE Effectively.” Distill. https://doi.org/10.23915/distill.00002.
Wolf, F. Alexander, Philipp Angerer, and Fabian J. Theis. 2018. “SCANPY: Large-Scale Single-Cell Gene Expression Data Analysis.” Genome Biology 19 (1). https://doi.org/10.1186/s13059-017-1382-0.
Xia, Lucy, Christy Lee, and Jingyi Jessica Li. 2024. “Statistical Method scDEED for Detecting Dubious 2D Single-Cell Embeddings and Optimizing t-SNE and UMAP Hyperparameters.” Nature Communications 15 (1): 1753.