Skip to content

Function Reference

API Reference

distortions.geometry

Geometry

Bases: object

The Geometry class stores the data, distance, affinity and laplacian matrices used by the various embedding methods and is the primary object passed to embedding functions.

The Geometry class contains functions to compute the aforementioned matrices and allows for re-computation whenever necessary.

Parameters:

Name Type Description Default
adjacency_method string {'auto', 'brute', 'pyflann', 'cyflann'}

method for computing pairwise radius neighbors graph.

'auto'
adjacency_kwds dict

dictionary containing keyword arguments for adjacency matrix. see distance.py docmuentation for arguments for each method. If new kwargs are passed to compute_adjacency_matrix then this dictionary will be updated.

None
affinity_method string {'auto', 'gaussian'}

method of computing affinity matrix

'auto'
affinity_kwds dict

dictionary containing keyword arguments for affinity matrix. see affinity.py documentation for arguments for each method. If new kwargs are passed to compute_affinity_matrix then this dictionary will be updated.

None
laplacian_method (string,)

type of laplacian to be computed. Possibilities are {'symmetricnormalized', 'geometric', 'renormalized', 'unnormalized', 'randomwalk'} see laplacian.py for more information.

'auto'
laplacian_kwds dice

dictionary containing keyword arguments for Laplacian matrix. see laplacian.py docmuentation for arguments for each method. If new kwargs are passed to compute_laplacian_matrix then this dictionary will be updated.

None
**kwargs

additional arguments will be parsed and used to override values in the above dictionaries. For example: - affinity_radius will override affinity_kwds['radius'] - adjacency_n_neighbors will override adjacency_kwds['n_neighbors'] etc.

{}
Source code in distortions/geometry/geometry.py
class Geometry(object):
    """
    The Geometry class stores the data, distance, affinity and laplacian
    matrices used by the various embedding methods and is the primary
    object passed to embedding functions.

    The Geometry class contains functions to compute the aforementioned
    matrices and allows for re-computation whenever necessary.

    Parameters
    ----------
    adjacency_method : string {'auto', 'brute', 'pyflann', 'cyflann'}
        method for computing pairwise radius neighbors graph.
    adjacency_kwds : dict
        dictionary containing keyword arguments for adjacency matrix.
        see distance.py docmuentation for arguments for each method.
        If new kwargs are passed to compute_adjacency_matrix then this
        dictionary will be updated.
    affinity_method : string {'auto', 'gaussian'}
        method of computing affinity matrix
    affinity_kwds : dict
        dictionary containing keyword arguments for affinity matrix.
        see affinity.py documentation for arguments for each method.
        If new kwargs are passed to compute_affinity_matrix then this
        dictionary will be updated.
    laplacian_method : string,
        type of laplacian to be computed. Possibilities are
        {'symmetricnormalized', 'geometric', 'renormalized',
        'unnormalized', 'randomwalk'} see laplacian.py for more information.
    laplacian_kwds : dice
        dictionary containing keyword arguments for Laplacian matrix.
        see laplacian.py docmuentation for arguments for each method.
        If new kwargs are passed to compute_laplacian_matrix then this
        dictionary will be updated.
    **kwargs :
        additional arguments will be parsed and used to override values in
        the above dictionaries. For example:
        - `affinity_radius` will override `affinity_kwds['radius']`
        - `adjacency_n_neighbors` will override `adjacency_kwds['n_neighbors']`
        etc.
    """
    def __init__(self, adjacency_method='auto', adjacency_kwds=None,
                 affinity_method='auto', affinity_kwds=None,
                 laplacian_method='auto',laplacian_kwds=None, **kwargs):
        self.adjacency_method = adjacency_method
        self.adjacency_kwds = dict(**(adjacency_kwds or {}))
        self.affinity_method = affinity_method
        self.affinity_kwds = dict(**(affinity_kwds or {}))
        self.laplacian_method = laplacian_method
        self.laplacian_kwds = dict(**(laplacian_kwds or {}))

        # map extra keywords: e.g. affinity_radius -> affinity_kwds['radius']
        dicts = dict(adjaceny=self.adjacency_kwds,
                     affinity=self.affinity_kwds,
                     laplacian=self.laplacian_kwds)
        for key, val in kwargs.items():
            keysplit = key.split('_')
            if keysplit[0] not in dicts:
                raise ValueError('key `{0}` not valid'.format(key))
            dicts[keysplit[0]]['_'.join(keysplit[1:])] = val

        self.X = None
        self.adjacency_matrix = None
        self.affinity_matrix = None
        self.laplacian_matrix = None
        self.laplacian_symmetric = None
        self.laplacian_weights = None

    def set_radius(self, radius, override=True, X=None, n_components=2):
        """Set the radius for the adjacency and affinity computation

        By default, this will override keyword arguments provided on
        initialization.

        Parameters
        ----------
        radius : float
            radius to set for adjacency and affinity.
        override : bool (default: True)
            if False, then only set radius if not already defined in
            `adjacency_args` and `affinity_args`.
        X : ndarray or sparse (optional)
            if provided, estimate a suitable radius from this data.
        n_components : int (default=2)
            the number of components to use when estimating the radius
        """
        if radius < 0:
            raise ValueError("radius must be non-negative")

        if override or ('radius' not in self.adjacency_kwds and
                        'n_neighbors' not in self.adjacency_kwds):
            self.adjacency_kwds['radius'] = radius

        if override or ('radius' not in self.affinity_kwds):
            self.affinity_kwds['radius'] = radius

    def set_matrix(self, X, input_type):
        """
        Set the data matrix given the input type.

        Parameters
        ----------
        X : array-like
            Input matrix to set.
        input_type : str
            Type of matrix to set. Options: {'data', 'adjacency', 'affinity'}
        """
        if input_type == 'data':
            self.set_data_matrix(X)
        elif input_type == 'adjacency':
            self.set_adjacency_matrix(X)
        elif input_type == 'affinity':
            self.set_affinity_matrix(X)
        else:
            raise ValueError("Unrecognized input_type: {0}".format(input_type))


    def compute_adjacency_matrix(self, copy=False, **kwargs):
        """
        This function will compute the adjacency matrix.
        In order to acquire the existing adjacency matrix use
        self.adjacency_matrix as comptute_adjacency_matrix() will re-compute
        the adjacency matrix.

        Parameters
        ----------
        copy : boolean, whether to return a copied version of the adjacency matrix
        **kwargs : see distance.py docmuentation for arguments for each method.

        Returns
        -------
        self.adjacency_matrix : sparse matrix (N_obs, N_obs)
            Non explicit 0.0 values should be considered not connected.
        """
        if self.X is None:
            raise ValueError(distance_error_msg)

        kwds = self.adjacency_kwds.copy()
        kwds.update(kwargs)
        self.adjacency_matrix = compute_adjacency_matrix(self.X,
                                                         self.adjacency_method,
                                                         **kwds)
        if copy:
            return self.adjacency_matrix.copy()
        else:
            return self.adjacency_matrix

    def compute_affinity_matrix(self, copy=False, **kwargs):
        """
        This function will compute the affinity matrix. In order to
        acquire the existing affinity matrix use self.affinity_matrix as
        comptute_affinity_matrix() will re-compute the affinity matrix.

        Parameters
        ----------
        copy : boolean
            whether to return a copied version of the affinity matrix
        **kwargs :
            see affinity.py docmuentation for arguments for each method.

        Returns
        -------
        self.affinity_matrix : sparse matrix (N_obs, N_obs)
            contains the pairwise affinity values using the Guassian kernel
            and bandwidth equal to the affinity_radius
        """
        if self.adjacency_matrix is None:
            self.compute_adjacency_matrix()

        kwds = self.affinity_kwds.copy()
        kwds.update(kwargs)
        self.affinity_matrix = compute_affinity_matrix(self.adjacency_matrix,
                                                       self.affinity_method,
                                                       **kwds)
        if copy:
            return self.affinity_matrix.copy()
        else:
            return self.affinity_matrix

    def compute_laplacian_matrix(self, copy=True, return_lapsym=False, **kwargs):
        """
        Note: this function will compute the laplacian matrix. In order to acquire
            the existing laplacian matrix use self.laplacian_matrix as
            compute_laplacian_matrix() will re-compute the laplacian matrix.

        Parameters
        ----------
        copy : boolean, whether to return copied version of the self.laplacian_matrix
        return_lapsym : boolean, if True returns additionally the symmetrized version of
            the requested laplacian and the re-normalization weights.
        **kwargs : see laplacian.py docmuentation for arguments for each method.

        Returns
        -------
        self.laplacian_matrix : sparse matrix (N_obs, N_obs).
            The requested laplacian.
        self.laplacian_symmetric : sparse matrix (N_obs, N_obs)
            The symmetric laplacian.
        self.laplacian_weights : ndarray (N_obs,)
            The renormalization weights used to make
            laplacian_matrix from laplacian_symmetric
        """
        if self.affinity_matrix is None:
            self.compute_affinity_matrix()

        kwds = self.laplacian_kwds.copy()
        kwds.update(kwargs)
        kwds['full_output'] = return_lapsym
        result = compute_laplacian_matrix(self.affinity_matrix,
                                          self.laplacian_method,
                                          **kwds)
        if return_lapsym:
            (self.laplacian_matrix,
             self.laplacian_symmetric,
             self.laplacian_weights) = result
        else:
            self.laplacian_matrix = result

        if copy:
            return self.laplacian_matrix.copy()
        else:
            return self.laplacian_matrix

    def set_data_matrix(self, X):
        """
        Set the data matrix.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            The original data set to input.
        """
        #X = check_array(X, accept_sparse=sparse_formats)
        self.X = X

    def set_adjacency_matrix(self, adjacency_mat):
        """
        Set the adjacency matrix.

        Parameters
        ----------
        adjacency_mat : sparse matrix, shape (n_samples, n_samples)
            The adjacency matrix to input.
        """
        #adjacency_mat = check_array(adjacency_mat, accept_sparse=sparse_formats)
        if adjacency_mat.shape[0] != adjacency_mat.shape[1]:
            raise ValueError("adjacency matrix is not square")
        self.adjacency_matrix = adjacency_mat

    def set_affinity_matrix(self, affinity_mat):
        """
        Set the affinity matrix.

        Parameters
        ----------
        affinity_mat : sparse matrix (N_obs, N_obs).
            The adjacency matrix to input.
        """
        #affinity_mat = check_array(affinity_mat, accept_sparse=sparse_formats)
        if affinity_mat.shape[0] != affinity_mat.shape[1]:
            raise ValueError("affinity matrix is not square")
        self.affinity_matrix = affinity_mat

    def set_laplacian_matrix(self, laplacian_mat):
        """
        Set the Laplacian matrix.

        Parameters
        ----------
        laplacian_mat : sparse matrix (N_obs, N_obs).
            The Laplacian matrix to input.
        """
        #laplacian_mat = check_array(laplacian_mat, accept_sparse = sparse_formats)
        if laplacian_mat.shape[0] != laplacian_mat.shape[1]:
            raise ValueError("Laplacian matrix is not square")
        self.laplacian_matrix = laplacian_mat

    def delete_data_matrix(self):
        """Delete the data matrix from the Geometry object."""
        self.X = None

    def delete_adjacency_matrix(self):
        """Delete the adjacency matrix from the Geometry object."""
        self.adjacency_matrix = None

    def delete_affinity_matrix(self):
        """Delete the affinity matrix from the Geometry object."""
        self.affinity_matrix = None

    def delete_laplacian_matrix(self):
        """Delete the Laplacian matrix from the Geometry object."""
        self.laplacian_matrix = None

compute_adjacency_matrix(copy=False, **kwargs)

This function will compute the adjacency matrix. In order to acquire the existing adjacency matrix use self.adjacency_matrix as comptute_adjacency_matrix() will re-compute the adjacency matrix.

Parameters:

Name Type Description Default
copy boolean, whether to return a copied version of the adjacency matrix
False
**kwargs see distance.py docmuentation for arguments for each method.
{}

Returns:

Type Description
self.adjacency_matrix : sparse matrix (N_obs, N_obs)

Non explicit 0.0 values should be considered not connected.

Source code in distortions/geometry/geometry.py
def compute_adjacency_matrix(self, copy=False, **kwargs):
    """
    This function will compute the adjacency matrix.
    In order to acquire the existing adjacency matrix use
    self.adjacency_matrix as comptute_adjacency_matrix() will re-compute
    the adjacency matrix.

    Parameters
    ----------
    copy : boolean, whether to return a copied version of the adjacency matrix
    **kwargs : see distance.py docmuentation for arguments for each method.

    Returns
    -------
    self.adjacency_matrix : sparse matrix (N_obs, N_obs)
        Non explicit 0.0 values should be considered not connected.
    """
    if self.X is None:
        raise ValueError(distance_error_msg)

    kwds = self.adjacency_kwds.copy()
    kwds.update(kwargs)
    self.adjacency_matrix = compute_adjacency_matrix(self.X,
                                                     self.adjacency_method,
                                                     **kwds)
    if copy:
        return self.adjacency_matrix.copy()
    else:
        return self.adjacency_matrix

compute_affinity_matrix(copy=False, **kwargs)

This function will compute the affinity matrix. In order to acquire the existing affinity matrix use self.affinity_matrix as comptute_affinity_matrix() will re-compute the affinity matrix.

Parameters:

Name Type Description Default
copy boolean

whether to return a copied version of the affinity matrix

False
**kwargs

see affinity.py docmuentation for arguments for each method.

{}

Returns:

Type Description
self.affinity_matrix : sparse matrix (N_obs, N_obs)

contains the pairwise affinity values using the Guassian kernel and bandwidth equal to the affinity_radius

Source code in distortions/geometry/geometry.py
def compute_affinity_matrix(self, copy=False, **kwargs):
    """
    This function will compute the affinity matrix. In order to
    acquire the existing affinity matrix use self.affinity_matrix as
    comptute_affinity_matrix() will re-compute the affinity matrix.

    Parameters
    ----------
    copy : boolean
        whether to return a copied version of the affinity matrix
    **kwargs :
        see affinity.py docmuentation for arguments for each method.

    Returns
    -------
    self.affinity_matrix : sparse matrix (N_obs, N_obs)
        contains the pairwise affinity values using the Guassian kernel
        and bandwidth equal to the affinity_radius
    """
    if self.adjacency_matrix is None:
        self.compute_adjacency_matrix()

    kwds = self.affinity_kwds.copy()
    kwds.update(kwargs)
    self.affinity_matrix = compute_affinity_matrix(self.adjacency_matrix,
                                                   self.affinity_method,
                                                   **kwds)
    if copy:
        return self.affinity_matrix.copy()
    else:
        return self.affinity_matrix

compute_laplacian_matrix(copy=True, return_lapsym=False, **kwargs)

Note: this function will compute the laplacian matrix. In order to acquire the existing laplacian matrix use self.laplacian_matrix as compute_laplacian_matrix() will re-compute the laplacian matrix.

Parameters:

Name Type Description Default
copy boolean, whether to return copied version of the self.laplacian_matrix
True
return_lapsym boolean, if True returns additionally the symmetrized version of

the requested laplacian and the re-normalization weights.

False
**kwargs see laplacian.py docmuentation for arguments for each method.
{}

Returns:

Type Description
self.laplacian_matrix : sparse matrix (N_obs, N_obs).

The requested laplacian.

self.laplacian_symmetric : sparse matrix (N_obs, N_obs)

The symmetric laplacian.

self.laplacian_weights : ndarray (N_obs,)

The renormalization weights used to make laplacian_matrix from laplacian_symmetric

Source code in distortions/geometry/geometry.py
def compute_laplacian_matrix(self, copy=True, return_lapsym=False, **kwargs):
    """
    Note: this function will compute the laplacian matrix. In order to acquire
        the existing laplacian matrix use self.laplacian_matrix as
        compute_laplacian_matrix() will re-compute the laplacian matrix.

    Parameters
    ----------
    copy : boolean, whether to return copied version of the self.laplacian_matrix
    return_lapsym : boolean, if True returns additionally the symmetrized version of
        the requested laplacian and the re-normalization weights.
    **kwargs : see laplacian.py docmuentation for arguments for each method.

    Returns
    -------
    self.laplacian_matrix : sparse matrix (N_obs, N_obs).
        The requested laplacian.
    self.laplacian_symmetric : sparse matrix (N_obs, N_obs)
        The symmetric laplacian.
    self.laplacian_weights : ndarray (N_obs,)
        The renormalization weights used to make
        laplacian_matrix from laplacian_symmetric
    """
    if self.affinity_matrix is None:
        self.compute_affinity_matrix()

    kwds = self.laplacian_kwds.copy()
    kwds.update(kwargs)
    kwds['full_output'] = return_lapsym
    result = compute_laplacian_matrix(self.affinity_matrix,
                                      self.laplacian_method,
                                      **kwds)
    if return_lapsym:
        (self.laplacian_matrix,
         self.laplacian_symmetric,
         self.laplacian_weights) = result
    else:
        self.laplacian_matrix = result

    if copy:
        return self.laplacian_matrix.copy()
    else:
        return self.laplacian_matrix

delete_adjacency_matrix()

Delete the adjacency matrix from the Geometry object.

Source code in distortions/geometry/geometry.py
def delete_adjacency_matrix(self):
    """Delete the adjacency matrix from the Geometry object."""
    self.adjacency_matrix = None

delete_affinity_matrix()

Delete the affinity matrix from the Geometry object.

Source code in distortions/geometry/geometry.py
def delete_affinity_matrix(self):
    """Delete the affinity matrix from the Geometry object."""
    self.affinity_matrix = None

delete_data_matrix()

Delete the data matrix from the Geometry object.

Source code in distortions/geometry/geometry.py
def delete_data_matrix(self):
    """Delete the data matrix from the Geometry object."""
    self.X = None

delete_laplacian_matrix()

Delete the Laplacian matrix from the Geometry object.

Source code in distortions/geometry/geometry.py
def delete_laplacian_matrix(self):
    """Delete the Laplacian matrix from the Geometry object."""
    self.laplacian_matrix = None

set_adjacency_matrix(adjacency_mat)

Set the adjacency matrix.

Parameters:

Name Type Description Default
adjacency_mat sparse matrix, shape (n_samples, n_samples)

The adjacency matrix to input.

required
Source code in distortions/geometry/geometry.py
def set_adjacency_matrix(self, adjacency_mat):
    """
    Set the adjacency matrix.

    Parameters
    ----------
    adjacency_mat : sparse matrix, shape (n_samples, n_samples)
        The adjacency matrix to input.
    """
    #adjacency_mat = check_array(adjacency_mat, accept_sparse=sparse_formats)
    if adjacency_mat.shape[0] != adjacency_mat.shape[1]:
        raise ValueError("adjacency matrix is not square")
    self.adjacency_matrix = adjacency_mat

set_affinity_matrix(affinity_mat)

Set the affinity matrix.

Parameters:

Name Type Description Default
affinity_mat sparse matrix (N_obs, N_obs).

The adjacency matrix to input.

required
Source code in distortions/geometry/geometry.py
def set_affinity_matrix(self, affinity_mat):
    """
    Set the affinity matrix.

    Parameters
    ----------
    affinity_mat : sparse matrix (N_obs, N_obs).
        The adjacency matrix to input.
    """
    #affinity_mat = check_array(affinity_mat, accept_sparse=sparse_formats)
    if affinity_mat.shape[0] != affinity_mat.shape[1]:
        raise ValueError("affinity matrix is not square")
    self.affinity_matrix = affinity_mat

set_data_matrix(X)

Set the data matrix.

Parameters:

Name Type Description Default
X (array - like, shape(n_samples, n_features))

The original data set to input.

required
Source code in distortions/geometry/geometry.py
def set_data_matrix(self, X):
    """
    Set the data matrix.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        The original data set to input.
    """
    #X = check_array(X, accept_sparse=sparse_formats)
    self.X = X

set_laplacian_matrix(laplacian_mat)

Set the Laplacian matrix.

Parameters:

Name Type Description Default
laplacian_mat sparse matrix (N_obs, N_obs).

The Laplacian matrix to input.

required
Source code in distortions/geometry/geometry.py
def set_laplacian_matrix(self, laplacian_mat):
    """
    Set the Laplacian matrix.

    Parameters
    ----------
    laplacian_mat : sparse matrix (N_obs, N_obs).
        The Laplacian matrix to input.
    """
    #laplacian_mat = check_array(laplacian_mat, accept_sparse = sparse_formats)
    if laplacian_mat.shape[0] != laplacian_mat.shape[1]:
        raise ValueError("Laplacian matrix is not square")
    self.laplacian_matrix = laplacian_mat

set_matrix(X, input_type)

Set the data matrix given the input type.

Parameters:

Name Type Description Default
X array - like

Input matrix to set.

required
input_type str

Type of matrix to set. Options: {'data', 'adjacency', 'affinity'}

required
Source code in distortions/geometry/geometry.py
def set_matrix(self, X, input_type):
    """
    Set the data matrix given the input type.

    Parameters
    ----------
    X : array-like
        Input matrix to set.
    input_type : str
        Type of matrix to set. Options: {'data', 'adjacency', 'affinity'}
    """
    if input_type == 'data':
        self.set_data_matrix(X)
    elif input_type == 'adjacency':
        self.set_adjacency_matrix(X)
    elif input_type == 'affinity':
        self.set_affinity_matrix(X)
    else:
        raise ValueError("Unrecognized input_type: {0}".format(input_type))

set_radius(radius, override=True, X=None, n_components=2)

Set the radius for the adjacency and affinity computation

By default, this will override keyword arguments provided on initialization.

Parameters:

Name Type Description Default
radius float

radius to set for adjacency and affinity.

required
override bool (default: True)

if False, then only set radius if not already defined in adjacency_args and affinity_args.

True
X ndarray or sparse(optional)

if provided, estimate a suitable radius from this data.

None
n_components int(default=2)

the number of components to use when estimating the radius

2
Source code in distortions/geometry/geometry.py
def set_radius(self, radius, override=True, X=None, n_components=2):
    """Set the radius for the adjacency and affinity computation

    By default, this will override keyword arguments provided on
    initialization.

    Parameters
    ----------
    radius : float
        radius to set for adjacency and affinity.
    override : bool (default: True)
        if False, then only set radius if not already defined in
        `adjacency_args` and `affinity_args`.
    X : ndarray or sparse (optional)
        if provided, estimate a suitable radius from this data.
    n_components : int (default=2)
        the number of components to use when estimating the radius
    """
    if radius < 0:
        raise ValueError("radius must be non-negative")

    if override or ('radius' not in self.adjacency_kwds and
                    'n_neighbors' not in self.adjacency_kwds):
        self.adjacency_kwds['radius'] = radius

    if override or ('radius' not in self.affinity_kwds):
        self.affinity_kwds['radius'] = radius

bind_metric(embedding, Hvv, Hs)

Combine embedding coordinates with local Riemannian metric information.

Parameters:

Name Type Description Default
embedding (ndarray, shape(n_samples, n_embedding_dims))

The low-dimensional embedding of the data. This should be the same array as the embedding argument passed to local_distortions.

required
Hvv (ndarray, shape(n_samples, n_embedding_dims, n_embedding_dims))

The singular vectors of the dual Riemannian metric tensor for each sample, as returned by local_distortions.

required
Hs (ndarray, shape(n_samples, n_embedding_dims))

The singular values of the dual Riemannian metric tensor for each sample, as returned by local_distortions.

required

Returns:

Name Type Description
combined DataFrame

A DataFrame containing the embedding coordinates, the singular vectors and singular values of the local dual Riemannian metric for each sample, and an additional column "angle" computed from the first two singular vector components.

Notes

This function is intended to facilitate analysis and visualization by merging the embedding and local metric information into a single tabular structure.

Source code in distortions/geometry/rmetric.py
def bind_metric(embedding, Hvv, Hs):
    """
    Combine embedding coordinates with local Riemannian metric information.

    Parameters
    ----------
    embedding : np.ndarray, shape (n_samples, n_embedding_dims)
        The low-dimensional embedding of the data. This should be the same array
        as the `embedding` argument passed to `local_distortions`.
    Hvv : np.ndarray, shape (n_samples, n_embedding_dims, n_embedding_dims)
        The singular vectors of the dual Riemannian metric tensor for each sample,
        as returned by `local_distortions`.
    Hs : np.ndarray, shape (n_samples, n_embedding_dims)
        The singular values of the dual Riemannian metric tensor for each sample,
        as returned by `local_distortions`.

    Returns
    -------
    combined : pd.DataFrame
        A DataFrame containing the embedding coordinates, the singular vectors and
        singular values of the local dual Riemannian metric for each sample, and
        an additional column "angle" computed from the first two singular vector
        components.

    Notes
    -----
    This function is intended to facilitate analysis and visualization by merging
    the embedding and local metric information into a single tabular structure.
    """
    K = embedding.shape[1]
    Hvv_df = pd.concat([arrays_to_df(Hvv), arrays_to_df(Hs)], axis=1)
    embedding_df = pd.DataFrame(embedding, columns=[f"embedding_{i}" for i in range(K)])
    embedding_df = embedding_df.reset_index(drop=True)
    Hvv_df = Hvv_df.reset_index(drop=True)

    # merge the embedding and metric data
    combined = pd.concat([embedding_df, Hvv_df], axis=1)
    metric_columns = sum([[f"x{i}", f"y{i}"] for i in range(K)], []) + [f"s{i}" for i in range(K)]
    combined.columns = list(embedding_df.columns) + metric_columns
    combined["angle"] = np.arctan(combined.y1 / combined.x1) * (180 / np.pi)
    return combined

boxplot_data(x, y, nbin=10, outlier_iqr=3, **kwargs)

Compute boxplot statistics and identify outliers within distance bins.

This function divides the x-values (typically true distances) into bins and computes boxplot statistics for the y-values (typically embedding distances) within each bin. It identifies outliers using the IQR method.

Parameters:

Name Type Description Default
x array - like

Input values used for binning (typically true/original distances).

required
y array - like

Target values for which to compute statistics (typically embedding distances).

required
nbin int

Number of bins to divide the x-value range into.

10
outlier_iqr float

IQR multiplier for outlier detection. Values beyond Q1 - outlier_iqrIQR or Q3 + outlier_iqrIQR within each bin are considered outliers.

3
**kwargs keyword arguments

Additional keyword arguments (currently unused).

{}

Returns:

Name Type Description
summaries DataFrame

DataFrame with boxplot statistics for each bin containing columns: - 'bin_id': bin identifier - 'q1', 'q2', 'q3': quartile values - 'min', 'max': minimum and maximum values - 'iqr': interquartile range - 'lower', 'upper': outlier detection bounds - 'bin': string representation of bin range

outliers DataFrame

DataFrame with outlier information containing columns: - 'index': original index of outlier point - 'bin_id': which bin the outlier belongs to - 'bin': string representation of bin range - 'value': the outlier y-value

Source code in distortions/geometry/neighborhoods.py
def boxplot_data(x, y, nbin=10, outlier_iqr=3, **kwargs):
    """
    Compute boxplot statistics and identify outliers within distance bins.

    This function divides the x-values (typically true distances) into bins and
    computes boxplot statistics for the y-values (typically embedding distances)
    within each bin. It identifies outliers using the IQR method.

    Parameters
    ----------
    x : array-like
        Input values used for binning (typically true/original distances).
    y : array-like
        Target values for which to compute statistics (typically embedding distances).
    nbin : int, default=10
        Number of bins to divide the x-value range into.
    outlier_iqr : float, default=3
        IQR multiplier for outlier detection. Values beyond Q1 - outlier_iqr*IQR
        or Q3 + outlier_iqr*IQR within each bin are considered outliers.
    **kwargs : keyword arguments
        Additional keyword arguments (currently unused).

    Returns
    -------
    summaries : pd.DataFrame
        DataFrame with boxplot statistics for each bin containing columns:
        - 'bin_id': bin identifier
        - 'q1', 'q2', 'q3': quartile values
        - 'min', 'max': minimum and maximum values
        - 'iqr': interquartile range
        - 'lower', 'upper': outlier detection bounds
        - 'bin': string representation of bin range
    outliers : pd.DataFrame  
        DataFrame with outlier information containing columns:
        - 'index': original index of outlier point
        - 'bin_id': which bin the outlier belongs to
        - 'bin': string representation of bin range
        - 'value': the outlier y-value
    """
    # divide the data into nbin groups, and compute quantiles in each
    bin_ids, bin_edges = pd.cut(x, bins=nbin, labels=False, retbins=True)
    bin_edges = np.round(bin_edges, 1)

    summaries = (
        pd.DataFrame({'bin_id': bin_ids, 'y': y})
        .groupby('bin_id', as_index=False)['y']
        .agg(q1=lambda v: np.percentile(v, 25),
             q2=lambda v: np.percentile(v, 50),
             q3=lambda v: np.percentile(v, 75),
             min='min', max='max')
    )
    summaries['iqr'] = summaries.q3 - summaries.q1
    summaries['lower'] = np.maximum(summaries.q2 - outlier_iqr * summaries.iqr, summaries['min'])
    summaries['upper'] = np.minimum(summaries.q2 + outlier_iqr * summaries.iqr, summaries['max'])
    summaries['bin'] = summaries['bin_id'].map(lambda b: f"{bin_edges[b]}-{bin_edges[b + 1]}")

    # compute outliers according to the IQR above
    outliers = [
        {"index": i, "bin_id": int(b), "bin": f"{bin_edges[b]}-{bin_edges[b + 1]}", "value": val}
        for i, (b, val) in enumerate(zip(bin_ids, y))
        if not np.isnan(b) and (
            val < summaries.loc[b, 'q1'] - outlier_iqr * summaries.loc[b, 'iqr'] or
            val > summaries.loc[b, 'q3'] + outlier_iqr * summaries.loc[b, 'iqr']
        )
    ]
    return summaries, pd.DataFrame(outliers)

local_distortions(embedding, data, geom)

Compute local Riemannian metric distortions for each sample.

Parameters:

Name Type Description Default
embedding (ndarray, shape(n_samples, n_embedding_dims))

Low-dimensional embedding of the data. Each row corresponds to a sample, and each column corresponds to an embedding dimension.

required
data (ndarray, shape(n_samples, n_features))

Original high-dimensional data. Each row is a sample, each column a feature.

required
geom Geometry

An instance of the Geometry class (from geometry.py) that provides methods for setting the data matrix and computing the Laplacian matrix.

required

Returns:

Name Type Description
H ndarray

Dual Riemannian metric tensor for each sample.

Hvv ndarray

Singular vectors of the dual metric tensor for each sample.

Hs ndarray

Singular values of the dual metric tensor for each sample.

Notes

This function sets the data matrix in the provided Geometry object, computes the Laplacian matrix, and then estimates the local Riemannian metric distortions in the embedding space using the original data.

Source code in distortions/geometry/rmetric.py
def local_distortions(embedding, data, geom):
    """
    Compute local Riemannian metric distortions for each sample.

    Parameters
    ----------
    embedding : np.ndarray, shape (n_samples, n_embedding_dims)
        Low-dimensional embedding of the data. Each row corresponds to a sample,
        and each column corresponds to an embedding dimension.
    data : np.ndarray, shape (n_samples, n_features)
        Original high-dimensional data. Each row is a sample, each column a feature.
    geom : Geometry
        An instance of the Geometry class (from geometry.py) that provides
        methods for setting the data matrix and computing the Laplacian matrix.

    Returns
    -------
    H : np.ndarray
        Dual Riemannian metric tensor for each sample.
    Hvv : np.ndarray
        Singular vectors of the dual metric tensor for each sample.
    Hs : np.ndarray
        Singular values of the dual metric tensor for each sample.

    Notes
    -----
    This function sets the data matrix in the provided Geometry object,
    computes the Laplacian matrix, and then estimates the local Riemannian
    metric distortions in the embedding space using the original data.
    """
    geom.set_data_matrix(data)
    L = geom.compute_laplacian_matrix()
    _, _, Hvv, Hs, _, H = riemann_metric(embedding, L, n_dim=2)
    return H, Hvv, Hs

neighborhood_distances(adata, embed_key='X_umap')

Compute pairwise distances between samples and their neighbors in both original and embedding spaces.

This function calculates pairwise distances between each sample and its neighbors in the original high-dimensional space and compares them with distances in the reduced embedding space. This is useful for analyzing how well the embedding preserves local neighborhood structure.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix. Must contain a precomputed embedding (e.g., UMAP or t-SNE) in obsm[embed_key] and a neighbor graph in obsp["distances"].

required
embed_key str

Key in adata.obsm where the embedding coordinates are stored.

"X_umap"

Returns:

Type Description
DataFrame

DataFrame with columns: - 'center': index of the sample (cell) - 'neighbor': index of the neighbor sample - 'true': distance in the original space (from adata.obsp["distances"]) - 'embedding': distance in the embedding space (from adata.obsm[embed_key])

Notes

The number of neighbors is determined by the structure of the neighbor graph in adata.obsp["distances"]. The function assumes that the embedding and neighbor graph have already been computed.

Source code in distortions/geometry/neighborhoods.py
def neighborhood_distances(adata, embed_key="X_umap"):
    """
    Compute pairwise distances between samples and their neighbors in both original and embedding spaces.

    This function calculates pairwise distances between each sample and its
    neighbors in the original high-dimensional space and compares them with
    distances in the reduced embedding space. This is useful for analyzing
    how well the embedding preserves local neighborhood structure.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix. Must contain a precomputed embedding (e.g., UMAP or t-SNE) in `obsm[embed_key]`
        and a neighbor graph in `obsp["distances"]`.
    embed_key : str, default="X_umap"
        Key in `adata.obsm` where the embedding coordinates are stored.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns:
            - 'center': index of the sample (cell)
            - 'neighbor': index of the neighbor sample
            - 'true': distance in the original space (from `adata.obsp["distances"]`)
            - 'embedding': distance in the embedding space (from `adata.obsm[embed_key]`)

    Notes
    -----
    The number of neighbors is determined by the structure of the neighbor graph in `adata.obsp["distances"]`.
    The function assumes that the embedding and neighbor graph have already been computed.
    """
    knn_graph = adata.obsp["distances"]
    dist_list = []

    for ix in range(len(adata)):
        neighbors = knn_graph[ix].nonzero()[1]
        true = knn_graph[ix, neighbors].toarray().flatten()
        embedding = cdist(
            [adata.obsm[embed_key][ix, :]], 
            adata.obsm[embed_key][neighbors, :]
        ).flatten()
        dist_list.append(pd.DataFrame({
            "center": [ix] * len(neighbors), 
            "neighbor": neighbors,
            "true": true,
            "embedding": embedding
        }))

    return pd.concat(dist_list)

local_distortions(embedding, data, geom)

Compute local Riemannian metric distortions for each sample.

Parameters:

Name Type Description Default
embedding (ndarray, shape(n_samples, n_embedding_dims))

Low-dimensional embedding of the data. Each row corresponds to a sample, and each column corresponds to an embedding dimension.

required
data (ndarray, shape(n_samples, n_features))

Original high-dimensional data. Each row is a sample, each column a feature.

required
geom Geometry

An instance of the Geometry class (from geometry.py) that provides methods for setting the data matrix and computing the Laplacian matrix.

required

Returns:

Name Type Description
H ndarray

Dual Riemannian metric tensor for each sample.

Hvv ndarray

Singular vectors of the dual metric tensor for each sample.

Hs ndarray

Singular values of the dual metric tensor for each sample.

Notes

This function sets the data matrix in the provided Geometry object, computes the Laplacian matrix, and then estimates the local Riemannian metric distortions in the embedding space using the original data.

Source code in distortions/geometry/rmetric.py
def local_distortions(embedding, data, geom):
    """
    Compute local Riemannian metric distortions for each sample.

    Parameters
    ----------
    embedding : np.ndarray, shape (n_samples, n_embedding_dims)
        Low-dimensional embedding of the data. Each row corresponds to a sample,
        and each column corresponds to an embedding dimension.
    data : np.ndarray, shape (n_samples, n_features)
        Original high-dimensional data. Each row is a sample, each column a feature.
    geom : Geometry
        An instance of the Geometry class (from geometry.py) that provides
        methods for setting the data matrix and computing the Laplacian matrix.

    Returns
    -------
    H : np.ndarray
        Dual Riemannian metric tensor for each sample.
    Hvv : np.ndarray
        Singular vectors of the dual metric tensor for each sample.
    Hs : np.ndarray
        Singular values of the dual metric tensor for each sample.

    Notes
    -----
    This function sets the data matrix in the provided Geometry object,
    computes the Laplacian matrix, and then estimates the local Riemannian
    metric distortions in the embedding space using the original data.
    """
    geom.set_data_matrix(data)
    L = geom.compute_laplacian_matrix()
    _, _, Hvv, Hs, _, H = riemann_metric(embedding, L, n_dim=2)
    return H, Hvv, Hs

neighborhoods

boxplot_data(x, y, nbin=10, outlier_iqr=3, **kwargs)

Compute boxplot statistics and identify outliers within distance bins.

This function divides the x-values (typically true distances) into bins and computes boxplot statistics for the y-values (typically embedding distances) within each bin. It identifies outliers using the IQR method.

Parameters:

Name Type Description Default
x array - like

Input values used for binning (typically true/original distances).

required
y array - like

Target values for which to compute statistics (typically embedding distances).

required
nbin int

Number of bins to divide the x-value range into.

10
outlier_iqr float

IQR multiplier for outlier detection. Values beyond Q1 - outlier_iqrIQR or Q3 + outlier_iqrIQR within each bin are considered outliers.

3
**kwargs keyword arguments

Additional keyword arguments (currently unused).

{}

Returns:

Name Type Description
summaries DataFrame

DataFrame with boxplot statistics for each bin containing columns: - 'bin_id': bin identifier - 'q1', 'q2', 'q3': quartile values - 'min', 'max': minimum and maximum values - 'iqr': interquartile range - 'lower', 'upper': outlier detection bounds - 'bin': string representation of bin range

outliers DataFrame

DataFrame with outlier information containing columns: - 'index': original index of outlier point - 'bin_id': which bin the outlier belongs to - 'bin': string representation of bin range - 'value': the outlier y-value

Source code in distortions/geometry/neighborhoods.py
def boxplot_data(x, y, nbin=10, outlier_iqr=3, **kwargs):
    """
    Compute boxplot statistics and identify outliers within distance bins.

    This function divides the x-values (typically true distances) into bins and
    computes boxplot statistics for the y-values (typically embedding distances)
    within each bin. It identifies outliers using the IQR method.

    Parameters
    ----------
    x : array-like
        Input values used for binning (typically true/original distances).
    y : array-like
        Target values for which to compute statistics (typically embedding distances).
    nbin : int, default=10
        Number of bins to divide the x-value range into.
    outlier_iqr : float, default=3
        IQR multiplier for outlier detection. Values beyond Q1 - outlier_iqr*IQR
        or Q3 + outlier_iqr*IQR within each bin are considered outliers.
    **kwargs : keyword arguments
        Additional keyword arguments (currently unused).

    Returns
    -------
    summaries : pd.DataFrame
        DataFrame with boxplot statistics for each bin containing columns:
        - 'bin_id': bin identifier
        - 'q1', 'q2', 'q3': quartile values
        - 'min', 'max': minimum and maximum values
        - 'iqr': interquartile range
        - 'lower', 'upper': outlier detection bounds
        - 'bin': string representation of bin range
    outliers : pd.DataFrame  
        DataFrame with outlier information containing columns:
        - 'index': original index of outlier point
        - 'bin_id': which bin the outlier belongs to
        - 'bin': string representation of bin range
        - 'value': the outlier y-value
    """
    # divide the data into nbin groups, and compute quantiles in each
    bin_ids, bin_edges = pd.cut(x, bins=nbin, labels=False, retbins=True)
    bin_edges = np.round(bin_edges, 1)

    summaries = (
        pd.DataFrame({'bin_id': bin_ids, 'y': y})
        .groupby('bin_id', as_index=False)['y']
        .agg(q1=lambda v: np.percentile(v, 25),
             q2=lambda v: np.percentile(v, 50),
             q3=lambda v: np.percentile(v, 75),
             min='min', max='max')
    )
    summaries['iqr'] = summaries.q3 - summaries.q1
    summaries['lower'] = np.maximum(summaries.q2 - outlier_iqr * summaries.iqr, summaries['min'])
    summaries['upper'] = np.minimum(summaries.q2 + outlier_iqr * summaries.iqr, summaries['max'])
    summaries['bin'] = summaries['bin_id'].map(lambda b: f"{bin_edges[b]}-{bin_edges[b + 1]}")

    # compute outliers according to the IQR above
    outliers = [
        {"index": i, "bin_id": int(b), "bin": f"{bin_edges[b]}-{bin_edges[b + 1]}", "value": val}
        for i, (b, val) in enumerate(zip(bin_ids, y))
        if not np.isnan(b) and (
            val < summaries.loc[b, 'q1'] - outlier_iqr * summaries.loc[b, 'iqr'] or
            val > summaries.loc[b, 'q3'] + outlier_iqr * summaries.loc[b, 'iqr']
        )
    ]
    return summaries, pd.DataFrame(outliers)

broken_knn(embedding, k=2, z_thresh=1.0)

Determine broken points in embedding space using k-NN distances and Z-score thresholding.

This function identifies potentially problematic points in an embedding by computing their average k-nearest neighbor distances, calculating Z-scores, and flagging points that exceed the threshold as broken or isolated.

Parameters:

Name Type Description Default
embedding (array - like, shape(n_samples, n_features))

The embedding coordinates for all samples.

required
k int

Number of nearest neighbors to consider for distance calculation.

2
z_thresh float

Z-score threshold for identifying broken points. Points with Z-scores greater than or equal to this value are considered broken.

1.0

Returns:

Type Description
list of int

List of indices of broken points, sorted by descending Z-score. If no points exceed the threshold, returns the single point with the highest Z-score.

Source code in distortions/geometry/neighborhoods.py
def broken_knn(embedding, k=2, z_thresh=1.0):
    """
    Determine broken points in embedding space using k-NN distances and Z-score thresholding.

    This function identifies potentially problematic points in an embedding by
    computing their average k-nearest neighbor distances, calculating Z-scores,
    and flagging points that exceed the threshold as broken or isolated.

    Parameters
    ----------
    embedding : array-like, shape (n_samples, n_features)
        The embedding coordinates for all samples.
    k : int, default=2
        Number of nearest neighbors to consider for distance calculation.
    z_thresh : float, default=1.0
        Z-score threshold for identifying broken points. Points with Z-scores
        greater than or equal to this value are considered broken.

    Returns
    -------
    list of int
        List of indices of broken points, sorted by descending Z-score.
        If no points exceed the threshold, returns the single point with
        the highest Z-score.
    """
    sub = embedding
    nbr_sub = NearestNeighbors(n_neighbors=k).fit(sub)
    d_sub, _ = nbr_sub.kneighbors(sub)
    d1 = d_sub.mean(axis=1) 

    # 2) Z-score & threshold
    mu, sigma = d1.mean(), d1.std()
    z = (d1 - mu) / sigma
    locs = np.where(z >= z_thresh)[0]
    if len(locs)==0:
        locs = [int(np.argmax(z))]

    # 3) rank by descending Z-score
    locs = sorted(locs, key=lambda i: z[i], reverse=True)
    return locs

identify_broken_box(dists, outlier_factor=3, nbin=10)

Identify broken links using boxplot-based outlier detection within distance bins.

This helper function bins the true distances and identifies outliers in the embedding distances within each bin using boxplot criteria.

Parameters:

Name Type Description Default
dists DataFrame

DataFrame with 'true' and 'embedding' distance columns.

required
outlier_factor float

IQR multiplier for outlier detection threshold.

3
nbin int

Number of bins to divide the true distance range into.

10

Returns:

Type Description
DataFrame

Copy of input distances DataFrame with additional 'brokenness' boolean column indicating which links are identified as broken outliers.

Source code in distortions/geometry/neighborhoods.py
def identify_broken_box(dists, outlier_factor=3, nbin=10):
    """
    Identify broken links using boxplot-based outlier detection within distance bins.

    This helper function bins the true distances and identifies outliers in the
    embedding distances within each bin using boxplot criteria.

    Parameters
    ----------
    dists : pd.DataFrame
        DataFrame with 'true' and 'embedding' distance columns.
    outlier_factor : float, default=3
        IQR multiplier for outlier detection threshold.
    nbin : int, default=10
        Number of bins to divide the true distance range into.

    Returns
    -------
    pd.DataFrame
        Copy of input distances DataFrame with additional 'brokenness' boolean column
        indicating which links are identified as broken outliers.
    """
    _, outliers = boxplot_data(dists["true"], dists["embedding"], nbin, outlier_factor)
    brokenness = dists.copy()
    brokenness = brokenness.reset_index()
    brokenness["brokenness"] = False
    brokenness.loc[outliers["index"].values, "brokenness"] = True
    return brokenness

identify_broken_window(dists, outlier_factor=3, percentiles=[75, 25], frame=[50, 50])

Identify broken links using sliding window smoothing and residual analysis.

This helper function applies a sliding window median filter to the distance relationship and identifies links where the embedding distance significantly exceeds the smoothed expectation.

Parameters:

Name Type Description Default
dists DataFrame

DataFrame with 'true' and 'embedding' distance columns.

required
outlier_factor float

Multiplier for IQR-based outlier threshold in residual analysis.

3
percentiles list of float

Percentiles used for IQR calculation.

[75, 25]
frame list of int

Window frame size [before, after] for sliding median calculation.

[50, 50]

Returns:

Type Description
DataFrame

DataFrame with original columns plus: - 'embedding_smooth': smoothed embedding distances - 'residual': difference between actual and smoothed embedding distances
- 'brokenness': boolean indicating broken links

Source code in distortions/geometry/neighborhoods.py
def identify_broken_window(dists, outlier_factor=3, percentiles=[75, 25], frame=[50, 50]):
    """
    Identify broken links using sliding window smoothing and residual analysis.

    This helper function applies a sliding window median filter to the distance
    relationship and identifies links where the embedding distance significantly
    exceeds the smoothed expectation.

    Parameters
    ----------
    dists : pd.DataFrame
        DataFrame with 'true' and 'embedding' distance columns.
    outlier_factor : float, default=3
        Multiplier for IQR-based outlier threshold in residual analysis.
    percentiles : list of float, default=[75, 25]
        Percentiles used for IQR calculation.
    frame : list of int, default=[50, 50]
        Window frame size [before, after] for sliding median calculation.

    Returns
    -------
    pd.DataFrame
        DataFrame with original columns plus:
        - 'embedding_smooth': smoothed embedding distances
        - 'residual': difference between actual and smoothed embedding distances  
        - 'brokenness': boolean indicating broken links
    """
    line = alt.Chart(dists).transform_window(
        embedding_smooth='median(embedding)',
        sort=[alt.SortField('true')],
        frame=frame
    ).mark_line().encode(
        x='true:Q',
        y='embedding_smooth:Q'
    )

    result = extract_data(line).drop_duplicates()
    result["residual"] = result["embedding"] - result["embedding_smooth"]
    result["brokenness"] = result["residual"] > result["embedding_smooth"] + \
        outlier_factor * iqr(result["residual"], percentiles)
    return result

iqr(x, percentiles)

Calculate the interquartile range between given percentiles.

This function computes the difference between two percentiles of the input array, typically used to measure the spread of data.

Parameters:

Name Type Description Default
x array - like

Input array for which to calculate the interquartile range.

required
percentiles array-like of length 2

Two percentile values (e.g., [25, 75] for standard IQR). The function returns the difference between the higher and lower percentiles.

required

Returns:

Type Description
float

The interquartile range (difference between the specified percentiles).

Source code in distortions/geometry/neighborhoods.py
def iqr(x, percentiles):
    """
    Calculate the interquartile range between given percentiles.

    This function computes the difference between two percentiles of the
    input array, typically used to measure the spread of data.

    Parameters
    ----------
    x : array-like
        Input array for which to calculate the interquartile range.
    percentiles : array-like of length 2
        Two percentile values (e.g., [25, 75] for standard IQR).
        The function returns the difference between the higher and lower percentiles.

    Returns
    -------
    float
        The interquartile range (difference between the specified percentiles).
    """
    return np.subtract(*np.percentile(x, percentiles))

neighbor_generator(embedding, broken_locations=[], number_neighbor=10)

Generate neighbor lists for broken points in the embedding space.

This function finds nearest neighbors for specified broken points (or automatically detected ones) in the embedding space. It's useful for understanding the local neighborhood structure around problematic points.

Parameters:

Name Type Description Default
embedding (array - like, shape(n_samples, n_features))

The embedding coordinates for all samples.

required
broken_locations list of int

Indices of broken points for which to generate neighbors. If empty, automatically detects broken points using broken_knn().

[]
number_neighbor int

Number of nearest neighbors to find for each broken point.

10

Returns:

Type Description
dict

Dictionary mapping broken point indices (int) to lists of their nearest neighbor indices, excluding the point itself.

Source code in distortions/geometry/neighborhoods.py
def neighbor_generator(embedding, broken_locations = [], number_neighbor=10):
    """
    Generate neighbor lists for broken points in the embedding space.

    This function finds nearest neighbors for specified broken points (or
    automatically detected ones) in the embedding space. It's useful for
    understanding the local neighborhood structure around problematic points.

    Parameters
    ----------
    embedding : array-like, shape (n_samples, n_features)
        The embedding coordinates for all samples.
    broken_locations : list of int, default=[]
        Indices of broken points for which to generate neighbors. If empty,
        automatically detects broken points using broken_knn().
    number_neighbor : int, default=10
        Number of nearest neighbors to find for each broken point.

    Returns
    -------
    dict
        Dictionary mapping broken point indices (int) to lists of their 
        nearest neighbor indices, excluding the point itself.
    """
    if len(broken_locations) == 0:
        broken_locations = broken_knn(embedding)
    nbr_full = NearestNeighbors(n_neighbors=number_neighbor+1).fit(embedding)
    isolated = {}
    for idx in broken_locations:
        _, neigh = nbr_full.kneighbors([embedding[idx]])
        isolated[int(idx)] = neigh[0][1:].tolist()  # drop self
    return isolated

neighborhood_distances(adata, embed_key='X_umap')

Compute pairwise distances between samples and their neighbors in both original and embedding spaces.

This function calculates pairwise distances between each sample and its neighbors in the original high-dimensional space and compares them with distances in the reduced embedding space. This is useful for analyzing how well the embedding preserves local neighborhood structure.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix. Must contain a precomputed embedding (e.g., UMAP or t-SNE) in obsm[embed_key] and a neighbor graph in obsp["distances"].

required
embed_key str

Key in adata.obsm where the embedding coordinates are stored.

"X_umap"

Returns:

Type Description
DataFrame

DataFrame with columns: - 'center': index of the sample (cell) - 'neighbor': index of the neighbor sample - 'true': distance in the original space (from adata.obsp["distances"]) - 'embedding': distance in the embedding space (from adata.obsm[embed_key])

Notes

The number of neighbors is determined by the structure of the neighbor graph in adata.obsp["distances"]. The function assumes that the embedding and neighbor graph have already been computed.

Source code in distortions/geometry/neighborhoods.py
def neighborhood_distances(adata, embed_key="X_umap"):
    """
    Compute pairwise distances between samples and their neighbors in both original and embedding spaces.

    This function calculates pairwise distances between each sample and its
    neighbors in the original high-dimensional space and compares them with
    distances in the reduced embedding space. This is useful for analyzing
    how well the embedding preserves local neighborhood structure.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix. Must contain a precomputed embedding (e.g., UMAP or t-SNE) in `obsm[embed_key]`
        and a neighbor graph in `obsp["distances"]`.
    embed_key : str, default="X_umap"
        Key in `adata.obsm` where the embedding coordinates are stored.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns:
            - 'center': index of the sample (cell)
            - 'neighbor': index of the neighbor sample
            - 'true': distance in the original space (from `adata.obsp["distances"]`)
            - 'embedding': distance in the embedding space (from `adata.obsm[embed_key]`)

    Notes
    -----
    The number of neighbors is determined by the structure of the neighbor graph in `adata.obsp["distances"]`.
    The function assumes that the embedding and neighbor graph have already been computed.
    """
    knn_graph = adata.obsp["distances"]
    dist_list = []

    for ix in range(len(adata)):
        neighbors = knn_graph[ix].nonzero()[1]
        true = knn_graph[ix, neighbors].toarray().flatten()
        embedding = cdist(
            [adata.obsm[embed_key][ix, :]], 
            adata.obsm[embed_key][neighbors, :]
        ).flatten()
        dist_list.append(pd.DataFrame({
            "center": [ix] * len(neighbors), 
            "neighbor": neighbors,
            "true": true,
            "embedding": embedding
        }))

    return pd.concat(dist_list)

neighborhoods(adata, outlier_factor=3, threshold=0.2, method='box', percentiles=[75, 25], frame=[50, 50], nbin=10, **kwargs)

Identify broken neighborhoods in embeddings using different methods.

This function serves as the main interface for detecting broken neighborhoods in dimensionality reduction embeddings. It supports multiple methods for identifying outliers and broken links between original and embedding spaces.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix with precomputed embedding and neighbor graph.

required
outlier_factor float

Factor used to determine outlier threshold. Higher values are more permissive (fewer outliers detected).

3
threshold float

Proportion threshold for flagging samples as having broken neighborhoods. Centers with more than this proportion of broken neighbors are flagged.

0.2
method str

Method for identifying broken neighborhoods. Options: - "box": Uses boxplot-based outlier detection - "window": Uses sliding window smoothing with residual analysis

"box"
percentiles list of float

Percentiles used for IQR calculation in windowing method.

[75, 25]
frame list of int

Window frame size [before, after] for sliding window smoothing.

[50, 50]
nbin int

Number of bins for boxplot method.

10
**kwargs keyword arguments

Additional arguments passed to neighborhood_distances().

{}

Returns:

Type Description
dict

Dictionary mapping center indices to lists of their neighbor indices for samples with broken neighborhoods.

Raises:

Type Description
NotImplementedError

If an unsupported method is specified.

Source code in distortions/geometry/neighborhoods.py
def neighborhoods(adata, outlier_factor=3, threshold=0.2, method="box",
                  percentiles=[75, 25], frame=[50, 50], nbin=10, **kwargs):
    """
    Identify broken neighborhoods in embeddings using different methods.

    This function serves as the main interface for detecting broken neighborhoods
    in dimensionality reduction embeddings. It supports multiple methods for
    identifying outliers and broken links between original and embedding spaces.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix with precomputed embedding and neighbor graph.
    outlier_factor : float, default=3
        Factor used to determine outlier threshold. Higher values are more
        permissive (fewer outliers detected).
    threshold : float, default=0.2  
        Proportion threshold for flagging samples as having broken neighborhoods.
        Centers with more than this proportion of broken neighbors are flagged.
    method : str, default="box"
        Method for identifying broken neighborhoods. Options:
        - "box": Uses boxplot-based outlier detection
        - "window": Uses sliding window smoothing with residual analysis
    percentiles : list of float, default=[75, 25]
        Percentiles used for IQR calculation in windowing method.
    frame : list of int, default=[50, 50]
        Window frame size [before, after] for sliding window smoothing.
    nbin : int, default=10
        Number of bins for boxplot method.
    **kwargs : keyword arguments
        Additional arguments passed to neighborhood_distances().

    Returns
    -------
    dict
        Dictionary mapping center indices to lists of their neighbor indices
        for samples with broken neighborhoods.

    Raises
    ------
    NotImplementedError
        If an unsupported method is specified.
    """
    if method == "box":
        return neighborhoods_box(adata, outlier_factor, threshold, nbin, **kwargs)
    if method == "window":
        return neighborhoods_window(adata, outlier_factor, threshold, percentiles, frame, **kwargs)
    else:
        return NotImplementedError(f"Method {method} not implemented for broken neighborhood construction.")

neighborhoods_box(adata, outlier_factor=3, threshold=0.2, nbin=10, **kwargs)

Identify broken neighborhoods using boxplot-based outlier detection.

This method bins the true distances and computes boxplot statistics within each bin. Links are considered broken if their embedding distance is an outlier relative to other links with similar true distances.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix with precomputed embedding and neighbor graph.

required
outlier_factor float

IQR multiplier for boxplot outlier detection. Values beyond Q1 - outlier_factorIQR or Q3 + outlier_factorIQR are outliers.

3
threshold float

Proportion threshold for flagging samples as having broken neighborhoods.

0.2
nbin int

Number of bins to divide the true distance range into.

10
**kwargs keyword arguments

Additional arguments passed to neighborhood_distances().

{}

Returns:

Type Description
dict

Dictionary mapping center indices to lists of their neighbor indices for samples with broken neighborhoods.

Source code in distortions/geometry/neighborhoods.py
def neighborhoods_box(adata, outlier_factor=3, threshold=0.2, nbin=10, **kwargs):
    """
    Identify broken neighborhoods using boxplot-based outlier detection.

    This method bins the true distances and computes boxplot statistics within
    each bin. Links are considered broken if their embedding distance is an
    outlier relative to other links with similar true distances.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix with precomputed embedding and neighbor graph.
    outlier_factor : float, default=3
        IQR multiplier for boxplot outlier detection. Values beyond
        Q1 - outlier_factor*IQR or Q3 + outlier_factor*IQR are outliers.
    threshold : float, default=0.2
        Proportion threshold for flagging samples as having broken neighborhoods.
    nbin : int, default=10
        Number of bins to divide the true distance range into.
    **kwargs : keyword arguments
        Additional arguments passed to neighborhood_distances().

    Returns
    -------
    dict
        Dictionary mapping center indices to lists of their neighbor indices
        for samples with broken neighborhoods.
    """
    dists = neighborhood_distances(adata, **kwargs)
    brokenness = identify_broken_box(dists, outlier_factor, nbin)
    return threshold_links(dists, brokenness, threshold)

neighborhoods_window(adata, outlier_factor=3, threshold=0.2, percentiles=[75, 25], frame=[50, 50], **kwargs)

Identify broken neighborhoods using window-based smoothing and residual analysis.

This method applies a sliding window median filter to the distance relationships and identifies outliers based on residuals from the smoothed curve. Points with large positive residuals indicate broken neighborhoods where embedding distances are much larger than expected.

Parameters:

Name Type Description Default
adata AnnData

Annotated data matrix with precomputed embedding and neighbor graph.

required
outlier_factor float

Multiplier for IQR-based outlier threshold. Residuals greater than median + outlier_factor * IQR are considered broken.

3
threshold float

Proportion threshold for flagging samples as having broken neighborhoods.

0.2
percentiles list of float

Percentiles used for IQR calculation in residual analysis.

[75, 25]
frame list of int

Window frame size [before, after] for sliding median calculation.

[50, 50]
**kwargs keyword arguments

Additional arguments passed to neighborhood_distances().

{}

Returns:

Type Description
dict

Dictionary mapping center indices to lists of their neighbor indices for samples with broken neighborhoods.

Source code in distortions/geometry/neighborhoods.py
def neighborhoods_window(adata, outlier_factor=3, threshold=0.2, percentiles=[75, 25], frame=[50, 50], **kwargs):
    """
    Identify broken neighborhoods using window-based smoothing and residual analysis.

    This method applies a sliding window median filter to the distance relationships
    and identifies outliers based on residuals from the smoothed curve. Points with
    large positive residuals indicate broken neighborhoods where embedding distances
    are much larger than expected.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix with precomputed embedding and neighbor graph.
    outlier_factor : float, default=3
        Multiplier for IQR-based outlier threshold. Residuals greater than
        median + outlier_factor * IQR are considered broken.
    threshold : float, default=0.2
        Proportion threshold for flagging samples as having broken neighborhoods.
    percentiles : list of float, default=[75, 25]
        Percentiles used for IQR calculation in residual analysis.
    frame : list of int, default=[50, 50]
        Window frame size [before, after] for sliding median calculation.
    **kwargs : keyword arguments
        Additional arguments passed to neighborhood_distances().

    Returns
    -------
    dict
        Dictionary mapping center indices to lists of their neighbor indices
        for samples with broken neighborhoods.
    """
    dists = neighborhood_distances(adata, **kwargs)
    brokenness = identify_broken_window(dists, outlier_factor, percentiles, frame)
    return threshold_links(dists, brokenness, threshold)

Flag samples with high proportions of broken neighborhood links.

This function identifies samples where the proportion of broken neighborhood links exceeds a specified threshold, indicating problematic embedding regions.

Parameters:

Name Type Description Default
dists DataFrame

DataFrame containing distance information with 'center' and 'neighbor' columns.

required
brokenness DataFrame

DataFrame with 'center' and 'brokenness' columns indicating broken links.

required
threshold float

Proportion threshold for flagging samples. Centers with more than this proportion of broken neighbors are included in the output.

0.2

Returns:

Type Description
dict

Dictionary mapping center indices (int) to lists of their neighbor indices for samples exceeding the brokenness threshold.

Source code in distortions/geometry/neighborhoods.py
def threshold_links(dists, brokenness, threshold=0.2):
    """
    Flag samples with high proportions of broken neighborhood links.

    This function identifies samples where the proportion of broken neighborhood
    links exceeds a specified threshold, indicating problematic embedding regions.

    Parameters
    ----------
    dists : pd.DataFrame
        DataFrame containing distance information with 'center' and 'neighbor' columns.
    brokenness : pd.DataFrame
        DataFrame with 'center' and 'brokenness' columns indicating broken links.
    threshold : float, default=0.2
        Proportion threshold for flagging samples. Centers with more than this
        proportion of broken neighbors are included in the output.

    Returns
    -------
    dict
        Dictionary mapping center indices (int) to lists of their neighbor indices
        for samples exceeding the brokenness threshold.
    """
    brokenness = brokenness.reset_index()
    centers = brokenness.center.unique()
    summary_dict = {}

    for i in range(len(centers)):
        subset = brokenness[brokenness["center"] == centers[i]]
        if np.mean(subset["brokenness"]) > threshold:
            brokenness.loc[i, "brokenness"] = True
            summary_dict[centers[i]] = [int(z) for z in dists[dists.center == centers[i]].neighbor.values]
    return summary_dict

distortions.visualization

dplot

Bases: AnyWidget

Interactive Distortion Plot Widget

This class provides an interactive widget for visualizing distortion metrics computed on datasets, with a ggplot2-like syntax for adding graphical marks and overlaying distortion criteria. It is designed for use in Jupyter environments and leverages the anywidget and traitlets libraries for interactivity. You can pause mouseover interactivity by holding down the control key.

Parameters:

Name Type Description Default
df DataFrame

The input dataset to visualize. Must be convertible to a list of records.

required
*args tuple

Additional positional arguments passed to the parent AnyWidget.

()
**kwargs dict

Additional keyword arguments passed to the parent AnyWidget and used as visualization options.

{}

Methods:

Name Description
mapping

Specify the mapping from data columns to visual properties.

geom_ellipse

Add an ellipse layer to the plot.

geom_hair

Add a hair (small oriented lines) layer to the plot.

labs

Add labels to the plot.

geom_edge_link

Add edge link geometry to the plot.

inter_edge_link

Add interactive edge link geometry to the plot.

inter_isometry

Add interactive isometry overlays to the plot.

scale_color

Add a color scale to the plot.

scale_size

Add a size scale to the plot.

inter_boxplot

Add an interactive boxplot layer for distortion metrics, using provided distance summaries and outlier information.

save

Save the current view to SVG.

Examples:

>>> import pandas as pd
>>> df = pd.DataFrame({...})
>>> dplot(df).mapping(x='embedding_1', y='embedding_2').geom_ellipse()
Source code in distortions/visualization/interactive.py
class dplot(anywidget.AnyWidget):
    """
    Interactive Distortion Plot Widget

    This class provides an interactive widget for visualizing distortion metrics
    computed on datasets, with a ggplot2-like syntax for adding graphical marks
    and overlaying distortion criteria. It is designed for use in Jupyter
    environments and leverages the anywidget and traitlets libraries for
    interactivity. You can pause mouseover interactivity by holding down the
    control key.

    Parameters
    ----------
    df : pandas.DataFrame
        The input dataset to visualize. Must be convertible to a list of records.
    *args : tuple
        Additional positional arguments passed to the parent AnyWidget.
    **kwargs : dict
        Additional keyword arguments passed to the parent AnyWidget and used as
        visualization options.

    Methods
    -------
    mapping(**kwargs)
        Specify the mapping from data columns to visual properties.
    geom_ellipse(**kwargs)
        Add an ellipse layer to the plot.
    geom_hair(**kwargs)
        Add a hair (small oriented lines) layer to the plot.
    labs(**kwargs)
        Add labels to the plot.
    geom_edge_link(**kwargs)
        Add edge link geometry to the plot.
    inter_edge_link(**kwargs)
        Add interactive edge link geometry to the plot.
    inter_isometry(**kwargs)
        Add interactive isometry overlays to the plot.
    scale_color(**kwargs)
        Add a color scale to the plot.
    scale_size(**kwargs)
        Add a size scale to the plot.
    inter_boxplot(dists, **kwargs)
        Add an interactive boxplot layer for distortion metrics, using provided
        distance summaries and outlier information.
    save(filename)
        Save the current view to SVG.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({...})
    >>> dplot(df).mapping(x='embedding_1', y='embedding_2').geom_ellipse()
    """
    widget_dir = Path(__file__).parent / "widget"
    _esm = widget_dir / "render.js"
    _mapping = traitlets.Dict().tag(sync=True)
    dataset = traitlets.List().tag(sync=True)
    layers = traitlets.List().tag(sync=True)
    neighbors = traitlets.List().tag(sync=True)
    distance_summaries = traitlets.List().tag(sync=True)
    outliers = traitlets.List().tag(sync=True)
    options = traitlets.Dict().tag(sync=True)
    elem_svg = traitlets.Unicode().tag(sync=True)

    def __init__(self, df, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dataset = df.to_dict("records")
        self.options = kwargs

    def mapping(self, **kwargs):
        """
        Specify the Mapping 
        """
        kwargs = {"angle": "angle", "a": "s1", "b": "s0", **kwargs}
        self._mapping = kwargs
        return self

    def geom_ellipse(self, **kwargs):
        self.layers = self.layers + [{"type": "geom_ellipse", "options": kwargs}]
        return self

    def geom_hair(self, **kwargs):
        self.layers = self.layers + [{'type': 'geom_hair', 'options': kwargs}]
        return self

    def labs(self, **kwargs):
        self.layers = self.layers + [{"type": "labs", "options": kwargs}]
        return self

    def geom_edge_link(self, **kwargs):
        self.layers = self.layers + [{"type": "geom_edge_link", "options": kwargs}]
        return self

    def inter_edge_link(self, **kwargs):
        self.layers = self.layers + [{"type": "inter_edge_link", "options": kwargs}]
        return self

    def inter_isometry(self, **kwargs):
        self.layers = self.layers + [{"type": "inter_isometry", "options": kwargs}]
        return self

    def scale_color(self, **kwargs):
        self.layers = self.layers + [{"type": "scale_color", "options": kwargs}]
        return self

    def scale_size(self, **kwargs):
        self.layers = self.layers + [{"type": "scale_size", "options": kwargs}]
        return self

    def inter_boxplot(self, dists, **kwargs):
        summaries, outliers = boxplot_data(dists["true"], dists["embedding"], **kwargs)
        outliers["center"] = dists.center.values[outliers["index"].values]
        outliers["neighbor"] = dists.neighbor.values[outliers["index"].values]

        # pass the related data to the visualization
        self.layers = self.layers + [{"type": "inter_boxplot", "options": kwargs}]
        self.distance_summaries = summaries.to_dict("records")
        self.outliers = outliers.to_dict("records")
        return self

    def save(self, filename="plot.svg"):
        self.send({"type": "save"})
        with open(filename, "w") as f:
            f.write(self.elem_svg)
        f.close()

mapping(**kwargs)

Specify the Mapping

Source code in distortions/visualization/interactive.py
def mapping(self, **kwargs):
    """
    Specify the Mapping 
    """
    kwargs = {"angle": "angle", "a": "s1", "b": "s0", **kwargs}
    self._mapping = kwargs
    return self

scanpy_umap(adata, max_cells=200, n_neighbors=10, n_pcs=40)

Runs UMAP visualization on an AnnData object with basic preprocessing.

This wrapper function filters genes by minimum count, applies log transformation, selects highly variable genes, computes neighbors in PCA space, and runs UMAP.

Parameters:

Name Type Description Default
adata AnnData

AnnData experiment object containing the data to filter, transform, and apply UMAP to.

required
max_cells int, optional (default: 200)

Maximum number of cells to use for visualization.

200
n_neighbors int, optional (default: 10)

Number of neighbors to use for constructing the neighborhood graph.

10
n_pcs int, optional (default: 40)

Number of principal components to use for neighborhood graph construction.

40

Returns:

Name Type Description
adata AnnData

The AnnData object after preprocessing and UMAP computation.

Notes

The function modifies the input AnnData object in place.

Examples:

>>> import scanpy as sc
>>> from distortion.visualization import scanpy_umap
>>> adata = sc.datasets.pbmc3k()
>>> adata_umap = scanpy_umap(adata, max_cells=100, n_neighbors=15, n_pcs=30)
>>> sc.pl.umap(adata_umap)
Source code in distortions/visualization/umap.py
def scanpy_umap(adata, max_cells=200, n_neighbors=10, n_pcs=40):
    """
    Runs UMAP visualization on an AnnData object with basic preprocessing.

    This wrapper function filters genes by minimum count, applies log
    transformation, selects highly variable genes, computes neighbors in PCA
    space, and runs UMAP.

    Parameters
    ----------
    adata : anndata.AnnData
        AnnData experiment object containing the data to filter, transform, and
        apply UMAP to.
    max_cells : int, optional (default: 200)
        Maximum number of cells to use for visualization.
    n_neighbors : int, optional (default: 10)
        Number of neighbors to use for constructing the neighborhood graph.
    n_pcs : int, optional (default: 40)
        Number of principal components to use for neighborhood graph construction.

    Returns
    -------
    adata : anndata.AnnData
        The AnnData object after preprocessing and UMAP computation.

    Notes
    -----
    The function modifies the input AnnData object in place.

    Examples
    --------
    >>> import scanpy as sc
    >>> from distortion.visualization import scanpy_umap
    >>> adata = sc.datasets.pbmc3k()
    >>> adata_umap = scanpy_umap(adata, max_cells=100, n_neighbors=15, n_pcs=30)
    >>> sc.pl.umap(adata_umap)
    """
    adata = adata[:max_cells, :]

    # Preprocess the dataset
    sc.pp.filter_genes(adata, min_counts=1)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, min_mean=0.5, min_disp=0.5)
    adata = adata[:, adata.var.highly_variable]

    # Run UMAP
    sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=n_pcs)
    sc.tl.umap(adata)
    return adata