Modern omics data have many features, and dimensionality reduction methods help create overview visualizations.
They can summarize microbiome community structure
\(\def\Dir{\text{Dir}}\) \(\def\Mult{\text{Mult}}\) \(\def\*#1{\mathbf{#1}}\) \(\def\m#1{\boldsymbol{#1}}\) \(\def\Unif{\text{Unif}}\) \(\def\win{\tilde{w}_{\text{in}}}\) \(\def\reals{\mathbb{R}}\) \(\newcommand{\wout}{\tilde w_{\text{out}}}\)
Modern omics data have many features, and dimensionality reduction methods help create overview visualizations.
They are also used in cell atlas construction and trajectory inference.
These methods need to be used with caution:
Despite being widely used, there are few diagnostics for these dimensionality reduction methods.
Fukuyama, J., Sankaran, K., & Symul, L. (2023). Multiscale analysis of count data through topic alignment. Biostatistics (Oxford, England), 24(4), 1045–1065. doi:10.1093/biostatistics/kxac018
Topic models suppose that samples \(x_i \in \mathbb{R}^{D}\) are drawn independently: \[\begin{align*} x_i \vert \gamma_i &\sim \text{Mult}\left(n_{i}, \*B\gamma_{i}\right) \\ \gamma_{i} &\sim \text{Dir}\left(\lambda_{\gamma} 1_{K}\right) \end{align*}\] where the columns \(\beta_{k}\) of \(\*B \in \Delta^{K}_{D}\) lie in the \(D\)-dimensional simplex and are drawn independently from, \[\begin{align*} \beta_{k} \sim \text{Dir}\left(\lambda_{\beta} 1_{D}\right). \end{align*}\]
Vertically stack the \(N\) \(\gamma_i\)’s into \(\Gamma \in \Delta^{N}_{K}\).
If we use topic models, Topic 2 increases during the antibiotic interventions, especially the first (Sankaran and Holmes 2018).
However, we stress that care should be taken in the interpretation of the inferred value of \(K\). To begin with, due to the very high dimensionality of the parameter space, we found it difficult to obtain reliable estimates of \(P\left(X \vert K\right)\)… There are also biological reasons to be careful interpreting \(K\).
– From (Pritchard, Stephens, and Donnelly 2000).
In practice, models are fit across many \(K\) and their goodness-of-fits are compared (Novembre 2016; Wallach et al. 2009; Lawson, Dorp, and Falush 2018).
alto: Main IdeaWe fit models of varying complexity \(K\) and build a compact representation of the ensemble
Columns are models and rectangles are topics.
An alignment is a graph where nodes \(V\) represent topics \(\beta\left(v\right) \in \Delta^D\) and memberships \(\Gamma\left(v\right) = \left(\gamma_{ik}^{m}\right)_{i=1}^{n} \in \reals^{N}\) across models.
Let \(V_p\) and \(V_q\) be two subsets of topics within the graph.
The weights \(W\) can be learned using optimal transport, \[\begin{align*} &\min_{W \in \mathcal{U}\left(p, q\right)} \left<C,W\right> \end{align*}\] \[\begin{align*} \mathcal{U}\left(p, q\right) := &\{W\in \mathbb{R}^{\left|V_{p}\right| \times \left|V_{q}\right|}_{+} : W 1_{\left|V_{q}\right|} = p \text{ and } W^{T} 1_{\left|V_{p}\right|} = q\}. \end{align*}\]
Below is the alignment for data from a topic model. Can you guess \(K\)?
The diagnostics suggest that the true \(K\) is 5.
(Ravel et al. 2010) used clustering to identify 5 Community State Types in the vaginal microbiome.
Four healthy CSTs are dominated by Lactobacillus variants.
A fifth CST is more diverse and is associated with preterm birth (Fettweis et al. 2019; Gudnadottir et al. 2022) and HIV transmission (Gosmann et al. 2017).

(Ravel et al. 2010) grouped samples (columns) into CSTs.
The follow-up study (Symul et al. 2023) had more samples than (Ravel et al. 2010) and so could identify additional structure lying behind CSTs.

Coherence is not a function of \(K\) alone.

Topic alignment is implemented in the R package alto.
All the simulations discussed today are vignettes in the package.
Sankaran, K., Zhang, S., Chenab, & Meilă, M. (2025). Interactive Visualization of Metric Distortion in Nonlinear Data Embeddings using the distortions Package. doi:10.1101/2025.08.21.671523
Both \(t\)-SNE and UMAP introduce distortions. For example, they may not preserve density within different regions of the plot.
Example from (Narayan, Berger, and Cho 2021).
They can also fail to preserve the topology of the underlying data…
Example from (Kobak and Linderman 2021).
These distortions are not mere technical curiosities – they significantly impact scientific interpretation (Liu, Ma, and Zhong 2025; Kobak and Linderman 2021). For example, they create misleading differences between cell types that are actually similar.
Example from (Xia, Lee, and Li 2024).

See also (Kozlov 2024; Irizarry 2024).
Rather than abandoning nonlinear dimensionality reduction, we augment the embeddings to characterize distortion.


This is a high-dimensional version of Tissot’s indicatrix from cartography (Laskowski 1989).
The RMetric algorithm (Perrault-Joncas and Meila 2006; McQueen et al. 2016) quantifies distortion geometrically. To motivate the algorithm, consider the distortion induced by mapping the sphere into latitude/longitude coordinates.
Parameterize points on \(\mathcal{M}\) using spherical coordinates:
\[\begin{align*} \mathbf{x}\left(p\right) = \left(\cos\varphi\cos\theta, \cos\varphi \sin\theta, \sin\varphi\right) \end{align*}\]
The associated (latitude, longitude) embedding is
\[\begin{align*} \mathbf{z}\left(p\right) = \left(\theta\left(p\right), \varphi\left(p\right)\right). \end{align*}\]
What does a small step in the embedding space correspond to in \(\mathcal{M}\)? The pushforward metric answers this,
\[\begin{align*} g_{ij} = \left\langle \frac{\partial\mathbf{x}}{\partial z^i}, \frac{\partial\mathbf{x}}{\partial z^j}\right\rangle_{\mathbb{R}^3} \end{align*}\]
This varies across \(p \in \mathcal{M}\) but we suppress it from the notation.
In the sphere example, these derivatives can be directly computed and to obtain,
\[\begin{align*} G = \begin{pmatrix} \cos^2\varphi & 0 \\ 0 & 1 \end{pmatrix} \end{align*}\] Near the equator (\(\varphi \approx 0\)), a small step in \(\theta\) covers more distance than near the north pole (\(\varphi \approx \frac{\pi}{2}\)).
Alternatively, the gradients \(\nabla z^i\) of the embedding dimensions also reflect distortion.
Since the level sets of \(z^\theta\) become more compressed near the poles, the gradients \(\nabla z^\theta\) become larger there.
This gradient information can be stored in the matrix \(H\) with elements,
\[\begin{align*} h^{ij} = \langle \nabla z^i, \nabla z^j \rangle_{g_{0}} \end{align*}\] where \(g_{0}\) is the metric on \(\mathcal{M}\) inherited from the ambient space.
\(H\) is computable from data while \(G\) requires an explicit manifold parameterization.
In our running example,
\[\begin{align*} H = \begin{pmatrix} 1/\cos^2\varphi & 0 \\ 0 & 1 \end{pmatrix} \end{align*}\]
We can see that \(H = G^{-1}\) and that is actually true more generally.
For any \(f\) and \(g\), the Laplacian \(\Delta\) satisfies, \[\begin{align*} \Delta(fg) = f\,\Delta g + g\,\Delta f + 2\langle \nabla f, \nabla g \rangle_{g_{0}} \end{align*}\]
Setting \(f = z^i\), \(g = z^j\) and rearranging, \[\begin{align*} h^{ij} = \langle \nabla z^i, \nabla z^j \rangle_{g_0} = \frac{1}{2}\left[\Delta(z^i z^j) - z^i\Delta z^j - z^j\Delta z^i\right] \end{align*}\]
Since there are methods for estimating \(\Delta\) from data (Hein, Audibert, and Luxburg 2005; Coifman and Lafon 2006), we also have a practical method for approximating local distortions \(H\)!
Let \(z_{k} \in \mathbb{R}^{N}\) be the \(k^{th}\) embedding dimension. Let \(L\) be the doubly-normalized graph Laplacian (Coifman and Lafon 2006). Compute
\[\begin{align*} H_{kk'}^{(\cdot)} := \frac{1}{2}\left[L\left(z_{k} \circ z_{k'}\right) - z_{k} \circ \left(L z_{k'}\right) - z_{k'} \circ \left(L z_{k}\right) \right] \in \mathbb{R}^{N} \end{align*}\]
The embedding distortion for sample \(n\) is given by \(H^{(n)} \in \mathbb{R}^{K \times K}\)
These two clusters are generated as:
\[\begin{align*} x_{i} \sim \frac{1}{2}\mathcal{N}\left(0, 10\right) + \frac{1}{2}\mathcal{N}\left(100, 1\right) \end{align*}\]
The UMAP embeddings lose information about the cluster density, but the difference is captured in the local metrics.
Since the metrics are known locally, the distortion can be inverted within a neighborhood of the cursor. For example, here we interactively adapt the embeddings in the Gaussian mixtures example.
Besides RMetric, we also visualize distortions using the scatterplot of true vs. embedding neighborhood distances.

To detect poorly preserved neighborhoods, we fit a running median to true vs. embedding distances, flag outliers above \(3\times \text{IQR}\), and mark points with many outlier links as “broken.”
This is the classic Swiss Roll data, but with higher density near the endpoints.
\(t\)-SNE (perplexity = 100) breaks the roll in the low-density region and artificially spreads the high density area.
This single-cell gene expression data set was used in the data visualization tutorial from the scanpy package (Wolf, Angerer, and Theis 2018). Each point is the UMAP embedding of a cell’s high-dimensional gene expression data.
In this hydra cell differentiation dataset (Siebert et al. 2019; Xia, Lee, and Li 2024), \(t\)-SNE (perplexity = 80) collapses points along the dataset periphery and exaggerates between-cluster distances

At perplexity = 500, the clusters are more reliable, but peripheral samples are in fact closer than they appear

The local isometry visualization highlights some “threads” are more spread in the original data.

Both variation in ellipses and fragmented neighborhood statistics can be used to compare competing algorithms, similarly to (Xia, Lee, and Li 2024; Venna and Kaski 2006).
This example uses data from a C. elegans cell differentiation study (Packer et al. 2019).
Both variation in ellipses and fragmented neighborhood statistics can be used to compare competing algorithms, similarly to (Xia, Lee, and Li 2024; Venna and Kaski 2006).
This example uses data from a C. elegans cell differentiation study (Packer et al. 2019).
Topic alignment helps better understand the influence of \(K\) in exploratory analysis of count data.
Interactivity can reveal distortion information based on the analyst’s priorities.
Papers: https://go.wisc.edu/oe3g62, https://go.wisc.edu/tify36
Packages: https://lasy.github.io/alto, https://pypi.org/project/distortions
The focus-plus-context principle (Heer and Card 2004) states that readers should be able to zoom into patterns of interest without losing relevant context.
Stacked barplots visualize sample-to-sample variation in microbiome community structure. They struggle to show finer taxonomic resolutions.
Figure from the microbiomeViz documentation (Barnett, Arts, and Penders 2021).
Stacked barplots visualize sample-to-sample variation in microbiome community structure. They struggle to show finer taxonomic resolutions.
Figure from the Qiime2 View documentation (Bolyen et al. 2019).
Applying focus-plus-context reveals rare taxa abundances while preserving overall structure
Figure from the phylobar documentation (Kuo et al. 2025).
For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]
Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.
For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]
Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.
For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]
Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.
For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]
Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.
For each vertex \(v\), select the highest-weight incoming edge: \[e^\ast(v) = \arg\max_{e:\,\text{target}(e)=v}\bigl[\tilde{w}_{\text{out}}(e)+\tilde{w}_{\text{in}}(e)\bigr].\]
Iterating from largest to smallest \(l\) yields a set of vertex-disjoint paths, the number of such paths characterizes the alignment.
A topic’s coherence measures its average similarity to other topics along the same path,
\[\begin{align*} c(v) = \frac{1}{|\text{Path}\left(v\right)|} \sum_{v' \in \text{Path}\left(v\right)} \min\left(\win\left(v, v'\right), \wout\left(v, v'\right) \right) \end{align*}\]
Transient topics (appearing or vanishing across \(K\)) score low, while consistently recovered topics score high.
Parent specificity identifies two distinct regimes,
Parent specificity identifies two distinct regimes,
\[\begin{align*} r(v)=\frac{\left|V_{m}\right|}{M-m} \sum_{m^{\prime}=m+1}^{M} \sum_{v_{m^{\prime}}^{\prime} \in V_{m^{\prime}}} \wout \left(v, v_{m^{\prime}}^{\prime}\right) \win\left(v, v_{m^{\prime}}^{\prime}\right) \end{align*}\]
This example comes from (“Understanding UMAP — Pair-Code.github.io”; Noichl). The 3D skeleton scans were produced by the Smithsonian, and we can use nonlinear dimensionality reduction to “flatten” the skeleton into 2D.
This is the embedding when applying UMAP with a 50 nearest-neighbor graph and min_dist = 0.5.
Parts of the shoulders, head, and tail are further apart in the embedding compared to the original data. Most other distortions are points that are placed too close to one another.
Parts of the shoulders, head, and tail are further apart in the embedding compared to the original data. Most other distortions are points that are placed too close to one another.
To compute \(L\), we use the estimator from (Coifman and Lafon 2006).
Build the kernel matrix \(W_{kl} = \exp(-\|X_k - X_l\|^2 / h)\), where \(h\) is a bandwidth hyperparameter.
Normalize both columns and rows. \[\begin{align*} D &= \text{diag}(W\mathbf{1}) \qquad \tilde{W} = D^{-1}WD^{-1} \\ \tilde{D} &= \text{diag}(\tilde{W}\mathbf{1}) \qquad L = \tilde{D}^{-1}\tilde{W} \end{align*}\] Column normalization accounts for differences in sampling density.
First note that, \[\begin{align*} \frac{\partial\mathbf{x}}{\partial \theta} = \left(-\cos\varphi\sin\theta, \cos\varphi\cos\theta, 0\right). \end{align*}\] Therefore, \[\begin{align*} g_{11} &= \left\langle \frac{\partial\mathbf{x}}{\partial \theta}, \frac{\partial\mathbf{x}}{\partial \theta}\right\rangle_{\mathbb{R}^3} \\ &= \cos^2\varphi\sin^2\theta + \cos^2\varphi\cos^2\theta \\ &= \cos^2\varphi \end{align*}\]
By the product formula with \(z^1 = \theta\): \[\begin{align*} h^{11} = \langle \nabla \theta, \nabla \theta \rangle_{g_0} = \frac{1}{2}\left[\Delta(\theta^2) - 2\theta\Delta\theta\right] \end{align*}\]
The general formula for the Laplace-Beltrami operator is \[\Delta f = \frac{1}{\sqrt{\det G}}\sum_{i,j}\frac{\partial}{\partial z^i}\left(\sqrt{\det G}\, g^{ij}\frac{\partial f}{\partial z^j}\right).\]
Since \(\det(G) = \cos^2\varphi\) and the off-diagonal \(g^{ij}\) are zero, \[\begin{align*} \Delta f = \frac{1}{\cos^2\varphi}\frac{\partial^2 f}{\partial\theta^2} + \frac{1}{\cos\varphi}\frac{\partial}{\partial\varphi}\left(\cos\varphi\frac{\partial f}{\partial\varphi}\right) \end{align*}\]
We can plug in the choices of \(f\) that we care about, \[\begin{align*} \Delta\theta &= 0\\ \Delta(\theta^2) &= \frac{1}{\cos^2\varphi}\frac{\partial^2(\theta^2)}{\partial\theta^2} = \frac{2}{\cos^2\varphi} \end{align*}\]
and then substitute into the formula from 2 slides ago, \[\begin{align*} h^{11} = \frac{1}{2}\left[\frac{2}{\cos^2\varphi} - 2\theta \cdot 0\right] = \frac{1}{\cos^2\varphi}. \end{align*}\]
There are two classic challenges in interactive interfaces (Hutchins, Hollan, and Norman 1985),
These challenges also apply to interactive data analysis.
There are two classic challenges in interactive interfaces (Hutchins, Hollan, and Norman 1985),
These challenges also apply to interactive data analysis.
Nonlinear dimensionality reduction has become the source of widespread concern in the single-cell literature (Chari and Pachter 2023).
We introduce two new diagnostic visualizations.
alto plots for choice of \(K\) in topic models . 
RMetric plots for nonlinear embeddings. 
Both are grounded in the data visualization principles that we review next.
To interpret topics, look for representative taxa. These species have higher probabilities in one topic compared to the others.
Topic alignment identifies departures from the assumed model. Consider the generative mechanism,
\[\begin{align*} x_{i} \vert \*B, \gamma_{i}, \nu_i &\sim \Mult\left(n_{i}, \alpha \*B\gamma_{i} + \left(1 - \alpha\right)\nu_i\right) \\ \nu_{i} &\sim \Dir\left(\lambda_{\nu}\right) \\ \gamma_i &\sim \Dir\left(\lambda_{\gamma}\right) \\ \beta_{k} &\sim \Dir\left(\lambda_{\beta}\right), \end{align*}\]
where \(\*B\) stacks the \(\beta_k\) rowwise.
The alignment structure is sensitive to changes in \(\alpha\) and fragments when structure is not present.


This structure is consistent across simulation runs, and the diagnostics quantify topic deterioration.

Alternatively, the gradients \(\nabla z^i\) of the embedding dimensions also reflect distortion.
In contrast, the gradients \(\nabla z^{\varphi}\) don’t depend on \(\varphi\).
The stability-based algorithm (Liu, Ma, and Zhong 2025) gives a similar interpretation. But the visual encoding is more subtle, and the leave-one-out approach is time consuming even with approximations.
They can make high-dimensional random walks look artificially smooth…
Example from (Wattenberg, Viégas, and Johnson 2016).