Function Reference

API Reference

`distortions.geometry`

`Geometry`

Bases: object

The Geometry class stores the data, distance, affinity and laplacian matrices used by the various embedding methods and is the primary object passed to embedding functions.

The Geometry class contains functions to compute the aforementioned matrices and allows for re-computation whenever necessary.

Parameters:

Name	Type	Description	Default
`adjacency_method`	`string {'auto', 'brute', 'pyflann', 'cyflann'}`	method for computing pairwise radius neighbors graph.	`'auto'`
`adjacency_kwds`	`dict`	dictionary containing keyword arguments for adjacency matrix. see distance.py docmuentation for arguments for each method. If new kwargs are passed to compute_adjacency_matrix then this dictionary will be updated.	`None`
`affinity_method`	`string {'auto', 'gaussian'}`	method of computing affinity matrix	`'auto'`
`affinity_kwds`	`dict`	dictionary containing keyword arguments for affinity matrix. see affinity.py documentation for arguments for each method. If new kwargs are passed to compute_affinity_matrix then this dictionary will be updated.	`None`
`laplacian_method`	`(string,)`	type of laplacian to be computed. Possibilities are {'symmetricnormalized', 'geometric', 'renormalized', 'unnormalized', 'randomwalk'} see laplacian.py for more information.	`'auto'`
`laplacian_kwds`	`dice`	dictionary containing keyword arguments for Laplacian matrix. see laplacian.py docmuentation for arguments for each method. If new kwargs are passed to compute_laplacian_matrix then this dictionary will be updated.	`None`
`**kwargs`		additional arguments will be parsed and used to override values in the above dictionaries. For example: - `affinity_radius` will override `affinity_kwds['radius']` - `adjacency_n_neighbors` will override `adjacency_kwds['n_neighbors']` etc.	`{}`

Source code in distortions/geometry/geometry.py

class Geometry(object):
    """
    The Geometry class stores the data, distance, affinity and laplacian
    matrices used by the various embedding methods and is the primary
    object passed to embedding functions.

    The Geometry class contains functions to compute the aforementioned
    matrices and allows for re-computation whenever necessary.

    Parameters
    ----------
    adjacency_method : string {'auto', 'brute', 'pyflann', 'cyflann'}
        method for computing pairwise radius neighbors graph.
    adjacency_kwds : dict
        dictionary containing keyword arguments for adjacency matrix.
        see distance.py docmuentation for arguments for each method.
        If new kwargs are passed to compute_adjacency_matrix then this
        dictionary will be updated.
    affinity_method : string {'auto', 'gaussian'}
        method of computing affinity matrix
    affinity_kwds : dict
        dictionary containing keyword arguments for affinity matrix.
        see affinity.py documentation for arguments for each method.
        If new kwargs are passed to compute_affinity_matrix then this
        dictionary will be updated.
    laplacian_method : string,
        type of laplacian to be computed. Possibilities are
        {'symmetricnormalized', 'geometric', 'renormalized',
        'unnormalized', 'randomwalk'} see laplacian.py for more information.
    laplacian_kwds : dice
        dictionary containing keyword arguments for Laplacian matrix.
        see laplacian.py docmuentation for arguments for each method.
        If new kwargs are passed to compute_laplacian_matrix then this
        dictionary will be updated.
    **kwargs :
        additional arguments will be parsed and used to override values in
        the above dictionaries. For example:
        - `affinity_radius` will override `affinity_kwds['radius']`
        - `adjacency_n_neighbors` will override `adjacency_kwds['n_neighbors']`
        etc.
    """
    def __init__(self, adjacency_method='auto', adjacency_kwds=None,
                 affinity_method='auto', affinity_kwds=None,
                 laplacian_method='auto',laplacian_kwds=None, **kwargs):
        self.adjacency_method = adjacency_method
        self.adjacency_kwds = dict(**(adjacency_kwds or {}))
        self.affinity_method = affinity_method
        self.affinity_kwds = dict(**(affinity_kwds or {}))
        self.laplacian_method = laplacian_method
        self.laplacian_kwds = dict(**(laplacian_kwds or {}))

        # map extra keywords: e.g. affinity_radius -> affinity_kwds['radius']
        dicts = dict(adjaceny=self.adjacency_kwds,
                     affinity=self.affinity_kwds,
                     laplacian=self.laplacian_kwds)
        for key, val in kwargs.items():
            keysplit = key.split('_')
            if keysplit[0] not in dicts:
                raise ValueError('key `{0}` not valid'.format(key))
            dicts[keysplit[0]]['_'.join(keysplit[1:])] = val

        self.X = None
        self.adjacency_matrix = None
        self.affinity_matrix = None
        self.laplacian_matrix = None
        self.laplacian_symmetric = None
        self.laplacian_weights = None

    def set_radius(self, radius, override=True, X=None, n_components=2):
        """Set the radius for the adjacency and affinity computation

        By default, this will override keyword arguments provided on
        initialization.

        Parameters
        ----------
        radius : float
            radius to set for adjacency and affinity.
        override : bool (default: True)
            if False, then only set radius if not already defined in
            `adjacency_args` and `affinity_args`.
        X : ndarray or sparse (optional)
            if provided, estimate a suitable radius from this data.
        n_components : int (default=2)
            the number of components to use when estimating the radius
        """
        if radius < 0:
            raise ValueError("radius must be non-negative")

        if override or ('radius' not in self.adjacency_kwds and
                        'n_neighbors' not in self.adjacency_kwds):
            self.adjacency_kwds['radius'] = radius

        if override or ('radius' not in self.affinity_kwds):
            self.affinity_kwds['radius'] = radius

    def set_matrix(self, X, input_type):
        """
        Set the data matrix given the input type.

        Parameters
        ----------
        X : array-like
            Input matrix to set.
        input_type : str
            Type of matrix to set. Options: {'data', 'adjacency', 'affinity'}
        """
        if input_type == 'data':
            self.set_data_matrix(X)
        elif input_type == 'adjacency':
            self.set_adjacency_matrix(X)
        elif input_type == 'affinity':
            self.set_affinity_matrix(X)
        else:
            raise ValueError("Unrecognized input_type: {0}".format(input_type))


    def compute_adjacency_matrix(self, copy=False, **kwargs):
        """
        This function will compute the adjacency matrix.
        In order to acquire the existing adjacency matrix use
        self.adjacency_matrix as comptute_adjacency_matrix() will re-compute
        the adjacency matrix.

        Parameters
        ----------
        copy : boolean, whether to return a copied version of the adjacency matrix
        **kwargs : see distance.py docmuentation for arguments for each method.

        Returns
        -------
        self.adjacency_matrix : sparse matrix (N_obs, N_obs)
            Non explicit 0.0 values should be considered not connected.
        """
        if self.X is None:
            raise ValueError(distance_error_msg)

        kwds = self.adjacency_kwds.copy()
        kwds.update(kwargs)
        self.adjacency_matrix = compute_adjacency_matrix(self.X,
                                                         self.adjacency_method,
                                                         **kwds)
        if copy:
            return self.adjacency_matrix.copy()
        else:
            return self.adjacency_matrix

    def compute_affinity_matrix(self, copy=False, **kwargs):
        """
        This function will compute the affinity matrix. In order to
        acquire the existing affinity matrix use self.affinity_matrix as
        comptute_affinity_matrix() will re-compute the affinity matrix.

        Parameters
        ----------
        copy : boolean
            whether to return a copied version of the affinity matrix
        **kwargs :
            see affinity.py docmuentation for arguments for each method.

        Returns
        -------
        self.affinity_matrix : sparse matrix (N_obs, N_obs)
            contains the pairwise affinity values using the Guassian kernel
            and bandwidth equal to the affinity_radius
        """
        if self.adjacency_matrix is None:
            self.compute_adjacency_matrix()

        kwds = self.affinity_kwds.copy()
        kwds.update(kwargs)
        self.affinity_matrix = compute_affinity_matrix(self.adjacency_matrix,
                                                       self.affinity_method,
                                                       **kwds)
        if copy:
            return self.affinity_matrix.copy()
        else:
            return self.affinity_matrix

    def compute_laplacian_matrix(self, copy=True, return_lapsym=False, **kwargs):
        """
        Note: this function will compute the laplacian matrix. In order to acquire
            the existing laplacian matrix use self.laplacian_matrix as
            compute_laplacian_matrix() will re-compute the laplacian matrix.

        Parameters
        ----------
        copy : boolean, whether to return copied version of the self.laplacian_matrix
        return_lapsym : boolean, if True returns additionally the symmetrized version of
            the requested laplacian and the re-normalization weights.
        **kwargs : see laplacian.py docmuentation for arguments for each method.

        Returns
        -------
        self.laplacian_matrix : sparse matrix (N_obs, N_obs).
            The requested laplacian.
        self.laplacian_symmetric : sparse matrix (N_obs, N_obs)
            The symmetric laplacian.
        self.laplacian_weights : ndarray (N_obs,)
            The renormalization weights used to make
            laplacian_matrix from laplacian_symmetric
        """
        if self.affinity_matrix is None:
            self.compute_affinity_matrix()

        kwds = self.laplacian_kwds.copy()
        kwds.update(kwargs)
        kwds['full_output'] = return_lapsym
        result = compute_laplacian_matrix(self.affinity_matrix,
                                          self.laplacian_method,
                                          **kwds)
        if return_lapsym:
            (self.laplacian_matrix,
             self.laplacian_symmetric,
             self.laplacian_weights) = result
        else:
            self.laplacian_matrix = result

        if copy:
            return self.laplacian_matrix.copy()
        else:
            return self.laplacian_matrix

    def set_data_matrix(self, X):
        """
        Set the data matrix.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            The original data set to input.
        """
        #X = check_array(X, accept_sparse=sparse_formats)
        self.X = X

    def set_adjacency_matrix(self, adjacency_mat):
        """
        Set the adjacency matrix.

        Parameters
        ----------
        adjacency_mat : sparse matrix, shape (n_samples, n_samples)
            The adjacency matrix to input.
        """
        #adjacency_mat = check_array(adjacency_mat, accept_sparse=sparse_formats)
        if adjacency_mat.shape[0] != adjacency_mat.shape[1]:
            raise ValueError("adjacency matrix is not square")
        self.adjacency_matrix = adjacency_mat

    def set_affinity_matrix(self, affinity_mat):
        """
        Set the affinity matrix.

        Parameters
        ----------
        affinity_mat : sparse matrix (N_obs, N_obs).
            The adjacency matrix to input.
        """
        #affinity_mat = check_array(affinity_mat, accept_sparse=sparse_formats)
        if affinity_mat.shape[0] != affinity_mat.shape[1]:
            raise ValueError("affinity matrix is not square")
        self.affinity_matrix = affinity_mat

    def set_laplacian_matrix(self, laplacian_mat):
        """
        Set the Laplacian matrix.

        Parameters
        ----------
        laplacian_mat : sparse matrix (N_obs, N_obs).
            The Laplacian matrix to input.
        """
        #laplacian_mat = check_array(laplacian_mat, accept_sparse = sparse_formats)
        if laplacian_mat.shape[0] != laplacian_mat.shape[1]:
            raise ValueError("Laplacian matrix is not square")
        self.laplacian_matrix = laplacian_mat

    def delete_data_matrix(self):
        """Delete the data matrix from the Geometry object."""
        self.X = None

    def delete_adjacency_matrix(self):
        """Delete the adjacency matrix from the Geometry object."""
        self.adjacency_matrix = None

    def delete_affinity_matrix(self):
        """Delete the affinity matrix from the Geometry object."""
        self.affinity_matrix = None

    def delete_laplacian_matrix(self):
        """Delete the Laplacian matrix from the Geometry object."""
        self.laplacian_matrix = None

`compute_adjacency_matrix(copy=False, **kwargs)`

This function will compute the adjacency matrix. In order to acquire the existing adjacency matrix use self.adjacency_matrix as comptute_adjacency_matrix() will re-compute the adjacency matrix.

Parameters:

Name	Type	Description	Default
`copy`	`boolean, whether to return a copied version of the adjacency matrix`		`False`
`**kwargs`	`see distance.py docmuentation for arguments for each method.`		`{}`

Returns:

Type	Description
`self.adjacency_matrix : sparse matrix (N_obs, N_obs)`	Non explicit 0.0 values should be considered not connected.

Source code in distortions/geometry/geometry.py

def compute_adjacency_matrix(self, copy=False, **kwargs):
    """
    This function will compute the adjacency matrix.
    In order to acquire the existing adjacency matrix use
    self.adjacency_matrix as comptute_adjacency_matrix() will re-compute
    the adjacency matrix.

    Parameters
    ----------
    copy : boolean, whether to return a copied version of the adjacency matrix
    **kwargs : see distance.py docmuentation for arguments for each method.

    Returns
    -------
    self.adjacency_matrix : sparse matrix (N_obs, N_obs)
        Non explicit 0.0 values should be considered not connected.
    """
    if self.X is None:
        raise ValueError(distance_error_msg)

    kwds = self.adjacency_kwds.copy()
    kwds.update(kwargs)
    self.adjacency_matrix = compute_adjacency_matrix(self.X,
                                                     self.adjacency_method,
                                                     **kwds)
    if copy:
        return self.adjacency_matrix.copy()
    else:
        return self.adjacency_matrix

`compute_affinity_matrix(copy=False, **kwargs)`

This function will compute the affinity matrix. In order to acquire the existing affinity matrix use self.affinity_matrix as comptute_affinity_matrix() will re-compute the affinity matrix.

Parameters:

Name	Type	Description	Default
`copy`	`boolean`	whether to return a copied version of the affinity matrix	`False`
`**kwargs`		see affinity.py docmuentation for arguments for each method.	`{}`

Returns:

Type	Description
`self.affinity_matrix : sparse matrix (N_obs, N_obs)`	contains the pairwise affinity values using the Guassian kernel and bandwidth equal to the affinity_radius

Source code in distortions/geometry/geometry.py

def compute_affinity_matrix(self, copy=False, **kwargs):
    """
    This function will compute the affinity matrix. In order to
    acquire the existing affinity matrix use self.affinity_matrix as
    comptute_affinity_matrix() will re-compute the affinity matrix.

    Parameters
    ----------
    copy : boolean
        whether to return a copied version of the affinity matrix
    **kwargs :
        see affinity.py docmuentation for arguments for each method.

    Returns
    -------
    self.affinity_matrix : sparse matrix (N_obs, N_obs)
        contains the pairwise affinity values using the Guassian kernel
        and bandwidth equal to the affinity_radius
    """
    if self.adjacency_matrix is None:
        self.compute_adjacency_matrix()

    kwds = self.affinity_kwds.copy()
    kwds.update(kwargs)
    self.affinity_matrix = compute_affinity_matrix(self.adjacency_matrix,
                                                   self.affinity_method,
                                                   **kwds)
    if copy:
        return self.affinity_matrix.copy()
    else:
        return self.affinity_matrix

`compute_laplacian_matrix(copy=True, return_lapsym=False, **kwargs)`

Note: this function will compute the laplacian matrix. In order to acquire the existing laplacian matrix use self.laplacian_matrix as compute_laplacian_matrix() will re-compute the laplacian matrix.

Parameters:

Name	Type	Description	Default
`copy`	`boolean, whether to return copied version of the self.laplacian_matrix`		`True`
`return_lapsym`	`boolean, if True returns additionally the symmetrized version of`	the requested laplacian and the re-normalization weights.	`False`
`**kwargs`	`see laplacian.py docmuentation for arguments for each method.`		`{}`

Returns:

Type	Description
`self.laplacian_matrix : sparse matrix (N_obs, N_obs).`	The requested laplacian.
`self.laplacian_symmetric : sparse matrix (N_obs, N_obs)`	The symmetric laplacian.
`self.laplacian_weights : ndarray (N_obs,)`	The renormalization weights used to make laplacian_matrix from laplacian_symmetric

Source code in distortions/geometry/geometry.py

def compute_laplacian_matrix(self, copy=True, return_lapsym=False, **kwargs):
    """
    Note: this function will compute the laplacian matrix. In order to acquire
        the existing laplacian matrix use self.laplacian_matrix as
        compute_laplacian_matrix() will re-compute the laplacian matrix.

    Parameters
    ----------
    copy : boolean, whether to return copied version of the self.laplacian_matrix
    return_lapsym : boolean, if True returns additionally the symmetrized version of
        the requested laplacian and the re-normalization weights.
    **kwargs : see laplacian.py docmuentation for arguments for each method.

    Returns
    -------
    self.laplacian_matrix : sparse matrix (N_obs, N_obs).
        The requested laplacian.
    self.laplacian_symmetric : sparse matrix (N_obs, N_obs)
        The symmetric laplacian.
    self.laplacian_weights : ndarray (N_obs,)
        The renormalization weights used to make
        laplacian_matrix from laplacian_symmetric
    """
    if self.affinity_matrix is None:
        self.compute_affinity_matrix()

    kwds = self.laplacian_kwds.copy()
    kwds.update(kwargs)
    kwds['full_output'] = return_lapsym
    result = compute_laplacian_matrix(self.affinity_matrix,
                                      self.laplacian_method,
                                      **kwds)
    if return_lapsym:
        (self.laplacian_matrix,
         self.laplacian_symmetric,
         self.laplacian_weights) = result
    else:
        self.laplacian_matrix = result

    if copy:
        return self.laplacian_matrix.copy()
    else:
        return self.laplacian_matrix

`delete_adjacency_matrix()`

Delete the adjacency matrix from the Geometry object.

Source code in distortions/geometry/geometry.py

def delete_adjacency_matrix(self):
    """Delete the adjacency matrix from the Geometry object."""
    self.adjacency_matrix = None

`delete_affinity_matrix()`

Delete the affinity matrix from the Geometry object.

Source code in distortions/geometry/geometry.py

def delete_affinity_matrix(self):
    """Delete the affinity matrix from the Geometry object."""
    self.affinity_matrix = None

`delete_data_matrix()`

Delete the data matrix from the Geometry object.

Source code in distortions/geometry/geometry.py

def delete_data_matrix(self):
    """Delete the data matrix from the Geometry object."""
    self.X = None

`delete_laplacian_matrix()`

Delete the Laplacian matrix from the Geometry object.

Source code in distortions/geometry/geometry.py

def delete_laplacian_matrix(self):
    """Delete the Laplacian matrix from the Geometry object."""
    self.laplacian_matrix = None

`set_adjacency_matrix(adjacency_mat)`

Set the adjacency matrix.

Parameters:

Name	Type	Description	Default
`adjacency_mat`	`sparse matrix, shape (n_samples, n_samples)`	The adjacency matrix to input.	required

Source code in distortions/geometry/geometry.py

def set_adjacency_matrix(self, adjacency_mat):
    """
    Set the adjacency matrix.

    Parameters
    ----------
    adjacency_mat : sparse matrix, shape (n_samples, n_samples)
        The adjacency matrix to input.
    """
    #adjacency_mat = check_array(adjacency_mat, accept_sparse=sparse_formats)
    if adjacency_mat.shape[0] != adjacency_mat.shape[1]:
        raise ValueError("adjacency matrix is not square")
    self.adjacency_matrix = adjacency_mat

`set_affinity_matrix(affinity_mat)`

Set the affinity matrix.

Parameters:

Name	Type	Description	Default
`affinity_mat`	`sparse matrix (N_obs, N_obs).`	The adjacency matrix to input.	required

Source code in distortions/geometry/geometry.py

def set_affinity_matrix(self, affinity_mat):
    """
    Set the affinity matrix.

    Parameters
    ----------
    affinity_mat : sparse matrix (N_obs, N_obs).
        The adjacency matrix to input.
    """
    #affinity_mat = check_array(affinity_mat, accept_sparse=sparse_formats)
    if affinity_mat.shape[0] != affinity_mat.shape[1]:
        raise ValueError("affinity matrix is not square")
    self.affinity_matrix = affinity_mat

`set_data_matrix(X)`

Set the data matrix.

Parameters:

Name	Type	Description	Default
`X`	`(array - like, shape(n_samples, n_features))`	The original data set to input.	required

Source code in distortions/geometry/geometry.py

def set_data_matrix(self, X):
    """
    Set the data matrix.

    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        The original data set to input.
    """
    #X = check_array(X, accept_sparse=sparse_formats)
    self.X = X

`set_laplacian_matrix(laplacian_mat)`

Set the Laplacian matrix.

Parameters:

Name	Type	Description	Default
`laplacian_mat`	`sparse matrix (N_obs, N_obs).`	The Laplacian matrix to input.	required

Source code in distortions/geometry/geometry.py

def set_laplacian_matrix(self, laplacian_mat):
    """
    Set the Laplacian matrix.

    Parameters
    ----------
    laplacian_mat : sparse matrix (N_obs, N_obs).
        The Laplacian matrix to input.
    """
    #laplacian_mat = check_array(laplacian_mat, accept_sparse = sparse_formats)
    if laplacian_mat.shape[0] != laplacian_mat.shape[1]:
        raise ValueError("Laplacian matrix is not square")
    self.laplacian_matrix = laplacian_mat

`set_matrix(X, input_type)`

Set the data matrix given the input type.

Parameters:

Name	Type	Description	Default
`X`	`array - like`	Input matrix to set.	required
`input_type`	`str`	Type of matrix to set. Options: {'data', 'adjacency', 'affinity'}	required

Source code in distortions/geometry/geometry.py

def set_matrix(self, X, input_type):
    """
    Set the data matrix given the input type.

    Parameters
    ----------
    X : array-like
        Input matrix to set.
    input_type : str
        Type of matrix to set. Options: {'data', 'adjacency', 'affinity'}
    """
    if input_type == 'data':
        self.set_data_matrix(X)
    elif input_type == 'adjacency':
        self.set_adjacency_matrix(X)
    elif input_type == 'affinity':
        self.set_affinity_matrix(X)
    else:
        raise ValueError("Unrecognized input_type: {0}".format(input_type))

`set_radius(radius, override=True, X=None, n_components=2)`

Set the radius for the adjacency and affinity computation

By default, this will override keyword arguments provided on initialization.

Parameters:

Name	Type	Description	Default
`radius`	`float`	radius to set for adjacency and affinity.	required
`override`	`bool (default: True)`	if False, then only set radius if not already defined in `adjacency_args` and `affinity_args`.	`True`
`X`	`ndarray or sparse(optional)`	if provided, estimate a suitable radius from this data.	`None`
`n_components`	`int(default=2)`	the number of components to use when estimating the radius	`2`

Source code in distortions/geometry/geometry.py

def set_radius(self, radius, override=True, X=None, n_components=2):
    """Set the radius for the adjacency and affinity computation

    By default, this will override keyword arguments provided on
    initialization.

    Parameters
    ----------
    radius : float
        radius to set for adjacency and affinity.
    override : bool (default: True)
        if False, then only set radius if not already defined in
        `adjacency_args` and `affinity_args`.
    X : ndarray or sparse (optional)
        if provided, estimate a suitable radius from this data.
    n_components : int (default=2)
        the number of components to use when estimating the radius
    """
    if radius < 0:
        raise ValueError("radius must be non-negative")

    if override or ('radius' not in self.adjacency_kwds and
                    'n_neighbors' not in self.adjacency_kwds):
        self.adjacency_kwds['radius'] = radius

    if override or ('radius' not in self.affinity_kwds):
        self.affinity_kwds['radius'] = radius

`bind_metric(embedding, Hvv, Hs)`

Combine embedding coordinates with local Riemannian metric information.

Parameters:

Name	Type	Description	Default
`embedding`	`(ndarray, shape(n_samples, n_embedding_dims))`	The low-dimensional embedding of the data. This should be the same array as the `embedding` argument passed to `local_distortions`.	required
`Hvv`	`(ndarray, shape(n_samples, n_embedding_dims, n_embedding_dims))`	The singular vectors of the dual Riemannian metric tensor for each sample, as returned by `local_distortions`.	required
`Hs`	`(ndarray, shape(n_samples, n_embedding_dims))`	The singular values of the dual Riemannian metric tensor for each sample, as returned by `local_distortions`.	required

Returns:

Name	Type	Description
`combined`	`DataFrame`	A DataFrame containing the embedding coordinates, the singular vectors and singular values of the local dual Riemannian metric for each sample, and an additional column "angle" computed from the first two singular vector components.

Notes

This function is intended to facilitate analysis and visualization by merging the embedding and local metric information into a single tabular structure.

Source code in distortions/geometry/rmetric.py

def bind_metric(embedding, Hvv, Hs):
    """
    Combine embedding coordinates with local Riemannian metric information.

    Parameters
    ----------
    embedding : np.ndarray, shape (n_samples, n_embedding_dims)
        The low-dimensional embedding of the data. This should be the same array
        as the `embedding` argument passed to `local_distortions`.
    Hvv : np.ndarray, shape (n_samples, n_embedding_dims, n_embedding_dims)
        The singular vectors of the dual Riemannian metric tensor for each sample,
        as returned by `local_distortions`.
    Hs : np.ndarray, shape (n_samples, n_embedding_dims)
        The singular values of the dual Riemannian metric tensor for each sample,
        as returned by `local_distortions`.

    Returns
    -------
    combined : pd.DataFrame
        A DataFrame containing the embedding coordinates, the singular vectors and
        singular values of the local dual Riemannian metric for each sample, and
        an additional column "angle" computed from the first two singular vector
        components.

    Notes
    -----
    This function is intended to facilitate analysis and visualization by merging
    the embedding and local metric information into a single tabular structure.
    """
    K = embedding.shape[1]
    Hvv_df = pd.concat([arrays_to_df(Hvv), arrays_to_df(Hs)], axis=1)
    embedding_df = pd.DataFrame(embedding, columns=[f"embedding_{i}" for i in range(K)])
    embedding_df = embedding_df.reset_index(drop=True)
    Hvv_df = Hvv_df.reset_index(drop=True)

    # merge the embedding and metric data
    combined = pd.concat([embedding_df, Hvv_df], axis=1)
    metric_columns = sum([[f"x{i}", f"y{i}"] for i in range(K)], []) + [f"s{i}" for i in range(K)]
    combined.columns = list(embedding_df.columns) + metric_columns
    combined["angle"] = np.arctan(combined.y1 / combined.x1) * (180 / np.pi)
    return combined

`boxplot_data(x, y, nbin=10, outlier_iqr=3, **kwargs)`

Compute boxplot statistics and identify outliers within distance bins.

This function divides the x-values (typically true distances) into bins and computes boxplot statistics for the y-values (typically embedding distances) within each bin. It identifies outliers using the IQR method.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	Input values used for binning (typically true/original distances).	required
`y`	`array - like`	Target values for which to compute statistics (typically embedding distances).	required
`nbin`	`int`	Number of bins to divide the x-value range into.	`10`
`outlier_iqr`	`float`	IQR multiplier for outlier detection. Values beyond Q1 - outlier_iqrIQR or Q3 + outlier_iqrIQR within each bin are considered outliers.	`3`
`**kwargs`	`keyword arguments`	Additional keyword arguments (currently unused).	`{}`

Returns:

Name	Type	Description
`summaries`	`DataFrame`	DataFrame with boxplot statistics for each bin containing columns: - 'bin_id': bin identifier - 'q1', 'q2', 'q3': quartile values - 'min', 'max': minimum and maximum values - 'iqr': interquartile range - 'lower', 'upper': outlier detection bounds - 'bin': string representation of bin range
`outliers`	`DataFrame`	DataFrame with outlier information containing columns: - 'index': original index of outlier point - 'bin_id': which bin the outlier belongs to - 'bin': string representation of bin range - 'value': the outlier y-value

Source code in distortions/geometry/neighborhoods.py

def boxplot_data(x, y, nbin=10, outlier_iqr=3, **kwargs):
    """
    Compute boxplot statistics and identify outliers within distance bins.

    This function divides the x-values (typically true distances) into bins and
    computes boxplot statistics for the y-values (typically embedding distances)
    within each bin. It identifies outliers using the IQR method.

    Parameters
    ----------
    x : array-like
        Input values used for binning (typically true/original distances).
    y : array-like
        Target values for which to compute statistics (typically embedding distances).
    nbin : int, default=10
        Number of bins to divide the x-value range into.
    outlier_iqr : float, default=3
        IQR multiplier for outlier detection. Values beyond Q1 - outlier_iqr*IQR
        or Q3 + outlier_iqr*IQR within each bin are considered outliers.
    **kwargs : keyword arguments
        Additional keyword arguments (currently unused).

    Returns
    -------
    summaries : pd.DataFrame
        DataFrame with boxplot statistics for each bin containing columns:
        - 'bin_id': bin identifier
        - 'q1', 'q2', 'q3': quartile values
        - 'min', 'max': minimum and maximum values
        - 'iqr': interquartile range
        - 'lower', 'upper': outlier detection bounds
        - 'bin': string representation of bin range
    outliers : pd.DataFrame  
        DataFrame with outlier information containing columns:
        - 'index': original index of outlier point
        - 'bin_id': which bin the outlier belongs to
        - 'bin': string representation of bin range
        - 'value': the outlier y-value
    """
    # divide the data into nbin groups, and compute quantiles in each
    bin_ids, bin_edges = pd.cut(x, bins=nbin, labels=False, retbins=True)
    bin_edges = np.round(bin_edges, 1)

    summaries = (
        pd.DataFrame({'bin_id': bin_ids, 'y': y})
        .groupby('bin_id', as_index=False)['y']
        .agg(q1=lambda v: np.percentile(v, 25),
             q2=lambda v: np.percentile(v, 50),
             q3=lambda v: np.percentile(v, 75),
             min='min', max='max')
    )
    summaries['iqr'] = summaries.q3 - summaries.q1
    summaries['lower'] = np.maximum(summaries.q2 - outlier_iqr * summaries.iqr, summaries['min'])
    summaries['upper'] = np.minimum(summaries.q2 + outlier_iqr * summaries.iqr, summaries['max'])
    summaries['bin'] = summaries['bin_id'].map(lambda b: f"{bin_edges[b]}-{bin_edges[b + 1]}")

    # compute outliers according to the IQR above
    outliers = [
        {"index": i, "bin_id": int(b), "bin": f"{bin_edges[b]}-{bin_edges[b + 1]}", "value": val}
        for i, (b, val) in enumerate(zip(bin_ids, y))
        if not np.isnan(b) and (
            val < summaries.loc[b, 'q1'] - outlier_iqr * summaries.loc[b, 'iqr'] or
            val > summaries.loc[b, 'q3'] + outlier_iqr * summaries.loc[b, 'iqr']
        )
    ]
    return summaries, pd.DataFrame(outliers)

`local_distortions(embedding, data, geom)`

Compute local Riemannian metric distortions for each sample.

Parameters:

Name	Type	Description	Default
`embedding`	`(ndarray, shape(n_samples, n_embedding_dims))`	Low-dimensional embedding of the data. Each row corresponds to a sample, and each column corresponds to an embedding dimension.	required
`data`	`(ndarray, shape(n_samples, n_features))`	Original high-dimensional data. Each row is a sample, each column a feature.	required
`geom`	`Geometry`	An instance of the Geometry class (from geometry.py) that provides methods for setting the data matrix and computing the Laplacian matrix.	required

Returns:

Name	Type	Description
`H`	`ndarray`	Dual Riemannian metric tensor for each sample.
`Hvv`	`ndarray`	Singular vectors of the dual metric tensor for each sample.
`Hs`	`ndarray`	Singular values of the dual metric tensor for each sample.

Notes

This function sets the data matrix in the provided Geometry object, computes the Laplacian matrix, and then estimates the local Riemannian metric distortions in the embedding space using the original data.

Source code in distortions/geometry/rmetric.py

def local_distortions(embedding, data, geom):
    """
    Compute local Riemannian metric distortions for each sample.

    Parameters
    ----------
    embedding : np.ndarray, shape (n_samples, n_embedding_dims)
        Low-dimensional embedding of the data. Each row corresponds to a sample,
        and each column corresponds to an embedding dimension.
    data : np.ndarray, shape (n_samples, n_features)
        Original high-dimensional data. Each row is a sample, each column a feature.
    geom : Geometry
        An instance of the Geometry class (from geometry.py) that provides
        methods for setting the data matrix and computing the Laplacian matrix.

    Returns
    -------
    H : np.ndarray
        Dual Riemannian metric tensor for each sample.
    Hvv : np.ndarray
        Singular vectors of the dual metric tensor for each sample.
    Hs : np.ndarray
        Singular values of the dual metric tensor for each sample.

    Notes
    -----
    This function sets the data matrix in the provided Geometry object,
    computes the Laplacian matrix, and then estimates the local Riemannian
    metric distortions in the embedding space using the original data.
    """
    geom.set_data_matrix(data)
    L = geom.compute_laplacian_matrix()
    _, _, Hvv, Hs, _, H = riemann_metric(embedding, L, n_dim=2)
    return H, Hvv, Hs

`neighborhood_distances(adata, embed_key='X_umap')`

Compute pairwise distances between samples and their neighbors in both original and embedding spaces.

This function calculates pairwise distances between each sample and its neighbors in the original high-dimensional space and compares them with distances in the reduced embedding space. This is useful for analyzing how well the embedding preserves local neighborhood structure.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Must contain a precomputed embedding (e.g., UMAP or t-SNE) in `obsm[embed_key]` and a neighbor graph in `obsp["distances"]`.	required
`embed_key`	`str`	Key in `adata.obsm` where the embedding coordinates are stored.	`"X_umap"`

Returns:

Type	Description
`DataFrame`	DataFrame with columns: - 'center': index of the sample (cell) - 'neighbor': index of the neighbor sample - 'true': distance in the original space (from `adata.obsp["distances"]`) - 'embedding': distance in the embedding space (from `adata.obsm[embed_key]`)

Notes

The number of neighbors is determined by the structure of the neighbor graph in adata.obsp["distances"]. The function assumes that the embedding and neighbor graph have already been computed.

Source code in distortions/geometry/neighborhoods.py

def neighborhood_distances(adata, embed_key="X_umap"):
    """
    Compute pairwise distances between samples and their neighbors in both original and embedding spaces.

    This function calculates pairwise distances between each sample and its
    neighbors in the original high-dimensional space and compares them with
    distances in the reduced embedding space. This is useful for analyzing
    how well the embedding preserves local neighborhood structure.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix. Must contain a precomputed embedding (e.g., UMAP or t-SNE) in `obsm[embed_key]`
        and a neighbor graph in `obsp["distances"]`.
    embed_key : str, default="X_umap"
        Key in `adata.obsm` where the embedding coordinates are stored.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns:
            - 'center': index of the sample (cell)
            - 'neighbor': index of the neighbor sample
            - 'true': distance in the original space (from `adata.obsp["distances"]`)
            - 'embedding': distance in the embedding space (from `adata.obsm[embed_key]`)

    Notes
    -----
    The number of neighbors is determined by the structure of the neighbor graph in `adata.obsp["distances"]`.
    The function assumes that the embedding and neighbor graph have already been computed.
    """
    knn_graph = adata.obsp["distances"]
    dist_list = []

    for ix in range(len(adata)):
        neighbors = knn_graph[ix].nonzero()[1]
        true = knn_graph[ix, neighbors].toarray().flatten()
        embedding = cdist(
            [adata.obsm[embed_key][ix, :]], 
            adata.obsm[embed_key][neighbors, :]
        ).flatten()
        dist_list.append(pd.DataFrame({
            "center": [ix] * len(neighbors), 
            "neighbor": neighbors,
            "true": true,
            "embedding": embedding
        }))

    return pd.concat(dist_list)

`local_distortions(embedding, data, geom)`

Compute local Riemannian metric distortions for each sample.

Parameters:

Name	Type	Description	Default
`embedding`	`(ndarray, shape(n_samples, n_embedding_dims))`	Low-dimensional embedding of the data. Each row corresponds to a sample, and each column corresponds to an embedding dimension.	required
`data`	`(ndarray, shape(n_samples, n_features))`	Original high-dimensional data. Each row is a sample, each column a feature.	required
`geom`	`Geometry`	An instance of the Geometry class (from geometry.py) that provides methods for setting the data matrix and computing the Laplacian matrix.	required

Returns:

Name	Type	Description
`H`	`ndarray`	Dual Riemannian metric tensor for each sample.
`Hvv`	`ndarray`	Singular vectors of the dual metric tensor for each sample.
`Hs`	`ndarray`	Singular values of the dual metric tensor for each sample.

Notes

This function sets the data matrix in the provided Geometry object, computes the Laplacian matrix, and then estimates the local Riemannian metric distortions in the embedding space using the original data.

Source code in distortions/geometry/rmetric.py

def local_distortions(embedding, data, geom):
    """
    Compute local Riemannian metric distortions for each sample.

    Parameters
    ----------
    embedding : np.ndarray, shape (n_samples, n_embedding_dims)
        Low-dimensional embedding of the data. Each row corresponds to a sample,
        and each column corresponds to an embedding dimension.
    data : np.ndarray, shape (n_samples, n_features)
        Original high-dimensional data. Each row is a sample, each column a feature.
    geom : Geometry
        An instance of the Geometry class (from geometry.py) that provides
        methods for setting the data matrix and computing the Laplacian matrix.

    Returns
    -------
    H : np.ndarray
        Dual Riemannian metric tensor for each sample.
    Hvv : np.ndarray
        Singular vectors of the dual metric tensor for each sample.
    Hs : np.ndarray
        Singular values of the dual metric tensor for each sample.

    Notes
    -----
    This function sets the data matrix in the provided Geometry object,
    computes the Laplacian matrix, and then estimates the local Riemannian
    metric distortions in the embedding space using the original data.
    """
    geom.set_data_matrix(data)
    L = geom.compute_laplacian_matrix()
    _, _, Hvv, Hs, _, H = riemann_metric(embedding, L, n_dim=2)
    return H, Hvv, Hs

`neighborhoods`

`boxplot_data(x, y, nbin=10, outlier_iqr=3, **kwargs)`

Compute boxplot statistics and identify outliers within distance bins.

This function divides the x-values (typically true distances) into bins and computes boxplot statistics for the y-values (typically embedding distances) within each bin. It identifies outliers using the IQR method.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	Input values used for binning (typically true/original distances).	required
`y`	`array - like`	Target values for which to compute statistics (typically embedding distances).	required
`nbin`	`int`	Number of bins to divide the x-value range into.	`10`
`outlier_iqr`	`float`	IQR multiplier for outlier detection. Values beyond Q1 - outlier_iqrIQR or Q3 + outlier_iqrIQR within each bin are considered outliers.	`3`
`**kwargs`	`keyword arguments`	Additional keyword arguments (currently unused).	`{}`

Returns:

Name	Type	Description
`summaries`	`DataFrame`	DataFrame with boxplot statistics for each bin containing columns: - 'bin_id': bin identifier - 'q1', 'q2', 'q3': quartile values - 'min', 'max': minimum and maximum values - 'iqr': interquartile range - 'lower', 'upper': outlier detection bounds - 'bin': string representation of bin range
`outliers`	`DataFrame`	DataFrame with outlier information containing columns: - 'index': original index of outlier point - 'bin_id': which bin the outlier belongs to - 'bin': string representation of bin range - 'value': the outlier y-value

Source code in distortions/geometry/neighborhoods.py

def boxplot_data(x, y, nbin=10, outlier_iqr=3, **kwargs):
    """
    Compute boxplot statistics and identify outliers within distance bins.

    This function divides the x-values (typically true distances) into bins and
    computes boxplot statistics for the y-values (typically embedding distances)
    within each bin. It identifies outliers using the IQR method.

    Parameters
    ----------
    x : array-like
        Input values used for binning (typically true/original distances).
    y : array-like
        Target values for which to compute statistics (typically embedding distances).
    nbin : int, default=10
        Number of bins to divide the x-value range into.
    outlier_iqr : float, default=3
        IQR multiplier for outlier detection. Values beyond Q1 - outlier_iqr*IQR
        or Q3 + outlier_iqr*IQR within each bin are considered outliers.
    **kwargs : keyword arguments
        Additional keyword arguments (currently unused).

    Returns
    -------
    summaries : pd.DataFrame
        DataFrame with boxplot statistics for each bin containing columns:
        - 'bin_id': bin identifier
        - 'q1', 'q2', 'q3': quartile values
        - 'min', 'max': minimum and maximum values
        - 'iqr': interquartile range
        - 'lower', 'upper': outlier detection bounds
        - 'bin': string representation of bin range
    outliers : pd.DataFrame  
        DataFrame with outlier information containing columns:
        - 'index': original index of outlier point
        - 'bin_id': which bin the outlier belongs to
        - 'bin': string representation of bin range
        - 'value': the outlier y-value
    """
    # divide the data into nbin groups, and compute quantiles in each
    bin_ids, bin_edges = pd.cut(x, bins=nbin, labels=False, retbins=True)
    bin_edges = np.round(bin_edges, 1)

    summaries = (
        pd.DataFrame({'bin_id': bin_ids, 'y': y})
        .groupby('bin_id', as_index=False)['y']
        .agg(q1=lambda v: np.percentile(v, 25),
             q2=lambda v: np.percentile(v, 50),
             q3=lambda v: np.percentile(v, 75),
             min='min', max='max')
    )
    summaries['iqr'] = summaries.q3 - summaries.q1
    summaries['lower'] = np.maximum(summaries.q2 - outlier_iqr * summaries.iqr, summaries['min'])
    summaries['upper'] = np.minimum(summaries.q2 + outlier_iqr * summaries.iqr, summaries['max'])
    summaries['bin'] = summaries['bin_id'].map(lambda b: f"{bin_edges[b]}-{bin_edges[b + 1]}")

    # compute outliers according to the IQR above
    outliers = [
        {"index": i, "bin_id": int(b), "bin": f"{bin_edges[b]}-{bin_edges[b + 1]}", "value": val}
        for i, (b, val) in enumerate(zip(bin_ids, y))
        if not np.isnan(b) and (
            val < summaries.loc[b, 'q1'] - outlier_iqr * summaries.loc[b, 'iqr'] or
            val > summaries.loc[b, 'q3'] + outlier_iqr * summaries.loc[b, 'iqr']
        )
    ]
    return summaries, pd.DataFrame(outliers)

`broken_knn(embedding, k=2, z_thresh=1.0)`

Determine broken points in embedding space using k-NN distances and Z-score thresholding.

This function identifies potentially problematic points in an embedding by computing their average k-nearest neighbor distances, calculating Z-scores, and flagging points that exceed the threshold as broken or isolated.

Parameters:

Name	Type	Description	Default
`embedding`	`(array - like, shape(n_samples, n_features))`	The embedding coordinates for all samples.	required
`k`	`int`	Number of nearest neighbors to consider for distance calculation.	`2`
`z_thresh`	`float`	Z-score threshold for identifying broken points. Points with Z-scores greater than or equal to this value are considered broken.	`1.0`

Returns:

Type	Description
`list of int`	List of indices of broken points, sorted by descending Z-score. If no points exceed the threshold, returns the single point with the highest Z-score.

Source code in distortions/geometry/neighborhoods.py

def broken_knn(embedding, k=2, z_thresh=1.0):
    """
    Determine broken points in embedding space using k-NN distances and Z-score thresholding.

    This function identifies potentially problematic points in an embedding by
    computing their average k-nearest neighbor distances, calculating Z-scores,
    and flagging points that exceed the threshold as broken or isolated.

    Parameters
    ----------
    embedding : array-like, shape (n_samples, n_features)
        The embedding coordinates for all samples.
    k : int, default=2
        Number of nearest neighbors to consider for distance calculation.
    z_thresh : float, default=1.0
        Z-score threshold for identifying broken points. Points with Z-scores
        greater than or equal to this value are considered broken.

    Returns
    -------
    list of int
        List of indices of broken points, sorted by descending Z-score.
        If no points exceed the threshold, returns the single point with
        the highest Z-score.
    """
    sub = embedding
    nbr_sub = NearestNeighbors(n_neighbors=k).fit(sub)
    d_sub, _ = nbr_sub.kneighbors(sub)
    d1 = d_sub.mean(axis=1) 

    # 2) Z-score & threshold
    mu, sigma = d1.mean(), d1.std()
    z = (d1 - mu) / sigma
    locs = np.where(z >= z_thresh)[0]
    if len(locs)==0:
        locs = [int(np.argmax(z))]

    # 3) rank by descending Z-score
    locs = sorted(locs, key=lambda i: z[i], reverse=True)
    return locs

`identify_broken_box(dists, outlier_factor=3, nbin=10)`

Identify broken links using boxplot-based outlier detection within distance bins.

This helper function bins the true distances and identifies outliers in the embedding distances within each bin using boxplot criteria.

Parameters:

Name	Type	Description	Default
`dists`	`DataFrame`	DataFrame with 'true' and 'embedding' distance columns.	required
`outlier_factor`	`float`	IQR multiplier for outlier detection threshold.	`3`
`nbin`	`int`	Number of bins to divide the true distance range into.	`10`

Returns:

Type	Description
`DataFrame`	Copy of input distances DataFrame with additional 'brokenness' boolean column indicating which links are identified as broken outliers.

Source code in distortions/geometry/neighborhoods.py

def identify_broken_box(dists, outlier_factor=3, nbin=10):
    """
    Identify broken links using boxplot-based outlier detection within distance bins.

    This helper function bins the true distances and identifies outliers in the
    embedding distances within each bin using boxplot criteria.

    Parameters
    ----------
    dists : pd.DataFrame
        DataFrame with 'true' and 'embedding' distance columns.
    outlier_factor : float, default=3
        IQR multiplier for outlier detection threshold.
    nbin : int, default=10
        Number of bins to divide the true distance range into.

    Returns
    -------
    pd.DataFrame
        Copy of input distances DataFrame with additional 'brokenness' boolean column
        indicating which links are identified as broken outliers.
    """
    _, outliers = boxplot_data(dists["true"], dists["embedding"], nbin, outlier_factor)
    brokenness = dists.copy()
    brokenness = brokenness.reset_index()
    brokenness["brokenness"] = False
    brokenness.loc[outliers["index"].values, "brokenness"] = True
    return brokenness

`identify_broken_window(dists, outlier_factor=3, percentiles=[75, 25], frame=[50, 50])`

Identify broken links using sliding window smoothing and residual analysis.

This helper function applies a sliding window median filter to the distance relationship and identifies links where the embedding distance significantly exceeds the smoothed expectation.

Parameters:

Name	Type	Description	Default
`dists`	`DataFrame`	DataFrame with 'true' and 'embedding' distance columns.	required
`outlier_factor`	`float`	Multiplier for IQR-based outlier threshold in residual analysis.	`3`
`percentiles`	`list of float`	Percentiles used for IQR calculation.	`[75, 25]`
`frame`	`list of int`	Window frame size [before, after] for sliding median calculation.	`[50, 50]`

Returns:

Type	Description
`DataFrame`	DataFrame with original columns plus: - 'embedding_smooth': smoothed embedding distances - 'residual': difference between actual and smoothed embedding distances - 'brokenness': boolean indicating broken links

Source code in distortions/geometry/neighborhoods.py

def identify_broken_window(dists, outlier_factor=3, percentiles=[75, 25], frame=[50, 50]):
    """
    Identify broken links using sliding window smoothing and residual analysis.

    This helper function applies a sliding window median filter to the distance
    relationship and identifies links where the embedding distance significantly
    exceeds the smoothed expectation.

    Parameters
    ----------
    dists : pd.DataFrame
        DataFrame with 'true' and 'embedding' distance columns.
    outlier_factor : float, default=3
        Multiplier for IQR-based outlier threshold in residual analysis.
    percentiles : list of float, default=[75, 25]
        Percentiles used for IQR calculation.
    frame : list of int, default=[50, 50]
        Window frame size [before, after] for sliding median calculation.

    Returns
    -------
    pd.DataFrame
        DataFrame with original columns plus:
        - 'embedding_smooth': smoothed embedding distances
        - 'residual': difference between actual and smoothed embedding distances  
        - 'brokenness': boolean indicating broken links
    """
    line = alt.Chart(dists).transform_window(
        embedding_smooth='median(embedding)',
        sort=[alt.SortField('true')],
        frame=frame
    ).mark_line().encode(
        x='true:Q',
        y='embedding_smooth:Q'
    )

    result = extract_data(line).drop_duplicates()
    result["residual"] = result["embedding"] - result["embedding_smooth"]
    result["brokenness"] = result["residual"] > result["embedding_smooth"] + \
        outlier_factor * iqr(result["residual"], percentiles)
    return result

`iqr(x, percentiles)`

Calculate the interquartile range between given percentiles.

This function computes the difference between two percentiles of the input array, typically used to measure the spread of data.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	Input array for which to calculate the interquartile range.	required
`percentiles`	`array-like of length 2`	Two percentile values (e.g., [25, 75] for standard IQR). The function returns the difference between the higher and lower percentiles.	required

Returns:

Type	Description
`float`	The interquartile range (difference between the specified percentiles).

Source code in distortions/geometry/neighborhoods.py

def iqr(x, percentiles):
    """
    Calculate the interquartile range between given percentiles.

    This function computes the difference between two percentiles of the
    input array, typically used to measure the spread of data.

    Parameters
    ----------
    x : array-like
        Input array for which to calculate the interquartile range.
    percentiles : array-like of length 2
        Two percentile values (e.g., [25, 75] for standard IQR).
        The function returns the difference between the higher and lower percentiles.

    Returns
    -------
    float
        The interquartile range (difference between the specified percentiles).
    """
    return np.subtract(*np.percentile(x, percentiles))

`neighbor_generator(embedding, broken_locations=[], number_neighbor=10)`

Generate neighbor lists for broken points in the embedding space.

This function finds nearest neighbors for specified broken points (or automatically detected ones) in the embedding space. It's useful for understanding the local neighborhood structure around problematic points.

Parameters:

Name	Type	Description	Default
`embedding`	`(array - like, shape(n_samples, n_features))`	The embedding coordinates for all samples.	required
`broken_locations`	`list of int`	Indices of broken points for which to generate neighbors. If empty, automatically detects broken points using broken_knn().	`[]`
`number_neighbor`	`int`	Number of nearest neighbors to find for each broken point.	`10`

Returns:

Type	Description
`dict`	Dictionary mapping broken point indices (int) to lists of their nearest neighbor indices, excluding the point itself.

Source code in distortions/geometry/neighborhoods.py

def neighbor_generator(embedding, broken_locations = [], number_neighbor=10):
    """
    Generate neighbor lists for broken points in the embedding space.

    This function finds nearest neighbors for specified broken points (or
    automatically detected ones) in the embedding space. It's useful for
    understanding the local neighborhood structure around problematic points.

    Parameters
    ----------
    embedding : array-like, shape (n_samples, n_features)
        The embedding coordinates for all samples.
    broken_locations : list of int, default=[]
        Indices of broken points for which to generate neighbors. If empty,
        automatically detects broken points using broken_knn().
    number_neighbor : int, default=10
        Number of nearest neighbors to find for each broken point.

    Returns
    -------
    dict
        Dictionary mapping broken point indices (int) to lists of their 
        nearest neighbor indices, excluding the point itself.
    """
    if len(broken_locations) == 0:
        broken_locations = broken_knn(embedding)
    nbr_full = NearestNeighbors(n_neighbors=number_neighbor+1).fit(embedding)
    isolated = {}
    for idx in broken_locations:
        _, neigh = nbr_full.kneighbors([embedding[idx]])
        isolated[int(idx)] = neigh[0][1:].tolist()  # drop self
    return isolated

`neighborhood_distances(adata, embed_key='X_umap')`

Compute pairwise distances between samples and their neighbors in both original and embedding spaces.

This function calculates pairwise distances between each sample and its neighbors in the original high-dimensional space and compares them with distances in the reduced embedding space. This is useful for analyzing how well the embedding preserves local neighborhood structure.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix. Must contain a precomputed embedding (e.g., UMAP or t-SNE) in `obsm[embed_key]` and a neighbor graph in `obsp["distances"]`.	required
`embed_key`	`str`	Key in `adata.obsm` where the embedding coordinates are stored.	`"X_umap"`

Returns:

Type	Description
`DataFrame`	DataFrame with columns: - 'center': index of the sample (cell) - 'neighbor': index of the neighbor sample - 'true': distance in the original space (from `adata.obsp["distances"]`) - 'embedding': distance in the embedding space (from `adata.obsm[embed_key]`)

Notes

The number of neighbors is determined by the structure of the neighbor graph in adata.obsp["distances"]. The function assumes that the embedding and neighbor graph have already been computed.

Source code in distortions/geometry/neighborhoods.py

def neighborhood_distances(adata, embed_key="X_umap"):
    """
    Compute pairwise distances between samples and their neighbors in both original and embedding spaces.

    This function calculates pairwise distances between each sample and its
    neighbors in the original high-dimensional space and compares them with
    distances in the reduced embedding space. This is useful for analyzing
    how well the embedding preserves local neighborhood structure.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix. Must contain a precomputed embedding (e.g., UMAP or t-SNE) in `obsm[embed_key]`
        and a neighbor graph in `obsp["distances"]`.
    embed_key : str, default="X_umap"
        Key in `adata.obsm` where the embedding coordinates are stored.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns:
            - 'center': index of the sample (cell)
            - 'neighbor': index of the neighbor sample
            - 'true': distance in the original space (from `adata.obsp["distances"]`)
            - 'embedding': distance in the embedding space (from `adata.obsm[embed_key]`)

    Notes
    -----
    The number of neighbors is determined by the structure of the neighbor graph in `adata.obsp["distances"]`.
    The function assumes that the embedding and neighbor graph have already been computed.
    """
    knn_graph = adata.obsp["distances"]
    dist_list = []

    for ix in range(len(adata)):
        neighbors = knn_graph[ix].nonzero()[1]
        true = knn_graph[ix, neighbors].toarray().flatten()
        embedding = cdist(
            [adata.obsm[embed_key][ix, :]], 
            adata.obsm[embed_key][neighbors, :]
        ).flatten()
        dist_list.append(pd.DataFrame({
            "center": [ix] * len(neighbors), 
            "neighbor": neighbors,
            "true": true,
            "embedding": embedding
        }))

    return pd.concat(dist_list)

`neighborhoods(adata, outlier_factor=3, threshold=0.2, method='box', percentiles=[75, 25], frame=[50, 50], nbin=10, **kwargs)`

Identify broken neighborhoods in embeddings using different methods.

This function serves as the main interface for detecting broken neighborhoods in dimensionality reduction embeddings. It supports multiple methods for identifying outliers and broken links between original and embedding spaces.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix with precomputed embedding and neighbor graph.	required
`outlier_factor`	`float`	Factor used to determine outlier threshold. Higher values are more permissive (fewer outliers detected).	`3`
`threshold`	`float`	Proportion threshold for flagging samples as having broken neighborhoods. Centers with more than this proportion of broken neighbors are flagged.	`0.2`
`method`	`str`	Method for identifying broken neighborhoods. Options: - "box": Uses boxplot-based outlier detection - "window": Uses sliding window smoothing with residual analysis	`"box"`
`percentiles`	`list of float`	Percentiles used for IQR calculation in windowing method.	`[75, 25]`
`frame`	`list of int`	Window frame size [before, after] for sliding window smoothing.	`[50, 50]`
`nbin`	`int`	Number of bins for boxplot method.	`10`
`**kwargs`	`keyword arguments`	Additional arguments passed to neighborhood_distances().	`{}`

Returns:

Type	Description
`dict`	Dictionary mapping center indices to lists of their neighbor indices for samples with broken neighborhoods.

Raises:

Type	Description
`NotImplementedError`	If an unsupported method is specified.

Source code in distortions/geometry/neighborhoods.py

def neighborhoods(adata, outlier_factor=3, threshold=0.2, method="box",
                  percentiles=[75, 25], frame=[50, 50], nbin=10, **kwargs):
    """
    Identify broken neighborhoods in embeddings using different methods.

    This function serves as the main interface for detecting broken neighborhoods
    in dimensionality reduction embeddings. It supports multiple methods for
    identifying outliers and broken links between original and embedding spaces.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix with precomputed embedding and neighbor graph.
    outlier_factor : float, default=3
        Factor used to determine outlier threshold. Higher values are more
        permissive (fewer outliers detected).
    threshold : float, default=0.2  
        Proportion threshold for flagging samples as having broken neighborhoods.
        Centers with more than this proportion of broken neighbors are flagged.
    method : str, default="box"
        Method for identifying broken neighborhoods. Options:
        - "box": Uses boxplot-based outlier detection
        - "window": Uses sliding window smoothing with residual analysis
    percentiles : list of float, default=[75, 25]
        Percentiles used for IQR calculation in windowing method.
    frame : list of int, default=[50, 50]
        Window frame size [before, after] for sliding window smoothing.
    nbin : int, default=10
        Number of bins for boxplot method.
    **kwargs : keyword arguments
        Additional arguments passed to neighborhood_distances().

    Returns
    -------
    dict
        Dictionary mapping center indices to lists of their neighbor indices
        for samples with broken neighborhoods.

    Raises
    ------
    NotImplementedError
        If an unsupported method is specified.
    """
    if method == "box":
        return neighborhoods_box(adata, outlier_factor, threshold, nbin, **kwargs)
    if method == "window":
        return neighborhoods_window(adata, outlier_factor, threshold, percentiles, frame, **kwargs)
    else:
        return NotImplementedError(f"Method {method} not implemented for broken neighborhood construction.")

`neighborhoods_box(adata, outlier_factor=3, threshold=0.2, nbin=10, **kwargs)`

Identify broken neighborhoods using boxplot-based outlier detection.

This method bins the true distances and computes boxplot statistics within each bin. Links are considered broken if their embedding distance is an outlier relative to other links with similar true distances.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix with precomputed embedding and neighbor graph.	required
`outlier_factor`	`float`	IQR multiplier for boxplot outlier detection. Values beyond Q1 - outlier_factorIQR or Q3 + outlier_factorIQR are outliers.	`3`
`threshold`	`float`	Proportion threshold for flagging samples as having broken neighborhoods.	`0.2`
`nbin`	`int`	Number of bins to divide the true distance range into.	`10`
`**kwargs`	`keyword arguments`	Additional arguments passed to neighborhood_distances().	`{}`

Returns:

Type	Description
`dict`	Dictionary mapping center indices to lists of their neighbor indices for samples with broken neighborhoods.

Source code in distortions/geometry/neighborhoods.py

def neighborhoods_box(adata, outlier_factor=3, threshold=0.2, nbin=10, **kwargs):
    """
    Identify broken neighborhoods using boxplot-based outlier detection.

    This method bins the true distances and computes boxplot statistics within
    each bin. Links are considered broken if their embedding distance is an
    outlier relative to other links with similar true distances.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix with precomputed embedding and neighbor graph.
    outlier_factor : float, default=3
        IQR multiplier for boxplot outlier detection. Values beyond
        Q1 - outlier_factor*IQR or Q3 + outlier_factor*IQR are outliers.
    threshold : float, default=0.2
        Proportion threshold for flagging samples as having broken neighborhoods.
    nbin : int, default=10
        Number of bins to divide the true distance range into.
    **kwargs : keyword arguments
        Additional arguments passed to neighborhood_distances().

    Returns
    -------
    dict
        Dictionary mapping center indices to lists of their neighbor indices
        for samples with broken neighborhoods.
    """
    dists = neighborhood_distances(adata, **kwargs)
    brokenness = identify_broken_box(dists, outlier_factor, nbin)
    return threshold_links(dists, brokenness, threshold)

`neighborhoods_window(adata, outlier_factor=3, threshold=0.2, percentiles=[75, 25], frame=[50, 50], **kwargs)`

Identify broken neighborhoods using window-based smoothing and residual analysis.

This method applies a sliding window median filter to the distance relationships and identifies outliers based on residuals from the smoothed curve. Points with large positive residuals indicate broken neighborhoods where embedding distances are much larger than expected.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	Annotated data matrix with precomputed embedding and neighbor graph.	required
`outlier_factor`	`float`	Multiplier for IQR-based outlier threshold. Residuals greater than median + outlier_factor * IQR are considered broken.	`3`
`threshold`	`float`	Proportion threshold for flagging samples as having broken neighborhoods.	`0.2`
`percentiles`	`list of float`	Percentiles used for IQR calculation in residual analysis.	`[75, 25]`
`frame`	`list of int`	Window frame size [before, after] for sliding median calculation.	`[50, 50]`
`**kwargs`	`keyword arguments`	Additional arguments passed to neighborhood_distances().	`{}`

Returns:

Type	Description
`dict`	Dictionary mapping center indices to lists of their neighbor indices for samples with broken neighborhoods.

Source code in distortions/geometry/neighborhoods.py

def neighborhoods_window(adata, outlier_factor=3, threshold=0.2, percentiles=[75, 25], frame=[50, 50], **kwargs):
    """
    Identify broken neighborhoods using window-based smoothing and residual analysis.

    This method applies a sliding window median filter to the distance relationships
    and identifies outliers based on residuals from the smoothed curve. Points with
    large positive residuals indicate broken neighborhoods where embedding distances
    are much larger than expected.

    Parameters
    ----------
    adata : anndata.AnnData
        Annotated data matrix with precomputed embedding and neighbor graph.
    outlier_factor : float, default=3
        Multiplier for IQR-based outlier threshold. Residuals greater than
        median + outlier_factor * IQR are considered broken.
    threshold : float, default=0.2
        Proportion threshold for flagging samples as having broken neighborhoods.
    percentiles : list of float, default=[75, 25]
        Percentiles used for IQR calculation in residual analysis.
    frame : list of int, default=[50, 50]
        Window frame size [before, after] for sliding median calculation.
    **kwargs : keyword arguments
        Additional arguments passed to neighborhood_distances().

    Returns
    -------
    dict
        Dictionary mapping center indices to lists of their neighbor indices
        for samples with broken neighborhoods.
    """
    dists = neighborhood_distances(adata, **kwargs)
    brokenness = identify_broken_window(dists, outlier_factor, percentiles, frame)
    return threshold_links(dists, brokenness, threshold)

`threshold_links(dists, brokenness, threshold=0.2)`

Flag samples with high proportions of broken neighborhood links.

This function identifies samples where the proportion of broken neighborhood links exceeds a specified threshold, indicating problematic embedding regions.

Parameters:

Name	Type	Description	Default
`dists`	`DataFrame`	DataFrame containing distance information with 'center' and 'neighbor' columns.	required
`brokenness`	`DataFrame`	DataFrame with 'center' and 'brokenness' columns indicating broken links.	required
`threshold`	`float`	Proportion threshold for flagging samples. Centers with more than this proportion of broken neighbors are included in the output.	`0.2`

Returns:

Type	Description
`dict`	Dictionary mapping center indices (int) to lists of their neighbor indices for samples exceeding the brokenness threshold.

Source code in distortions/geometry/neighborhoods.py

def threshold_links(dists, brokenness, threshold=0.2):
    """
    Flag samples with high proportions of broken neighborhood links.

    This function identifies samples where the proportion of broken neighborhood
    links exceeds a specified threshold, indicating problematic embedding regions.

    Parameters
    ----------
    dists : pd.DataFrame
        DataFrame containing distance information with 'center' and 'neighbor' columns.
    brokenness : pd.DataFrame
        DataFrame with 'center' and 'brokenness' columns indicating broken links.
    threshold : float, default=0.2
        Proportion threshold for flagging samples. Centers with more than this
        proportion of broken neighbors are included in the output.

    Returns
    -------
    dict
        Dictionary mapping center indices (int) to lists of their neighbor indices
        for samples exceeding the brokenness threshold.
    """
    brokenness = brokenness.reset_index()
    centers = brokenness.center.unique()
    summary_dict = {}

    for i in range(len(centers)):
        subset = brokenness[brokenness["center"] == centers[i]]
        if np.mean(subset["brokenness"]) > threshold:
            brokenness.loc[i, "brokenness"] = True
            summary_dict[centers[i]] = [int(z) for z in dists[dists.center == centers[i]].neighbor.values]
    return summary_dict

`distortions.visualization`

`dplot`

Bases: AnyWidget

Interactive Distortion Plot Widget

This class provides an interactive widget for visualizing distortion metrics computed on datasets, with a ggplot2-like syntax for adding graphical marks and overlaying distortion criteria. It is designed for use in Jupyter environments and leverages the anywidget and traitlets libraries for interactivity. You can pause mouseover interactivity by holding down the control key.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input dataset to visualize. Must be convertible to a list of records.	required
`*args`	`tuple`	Additional positional arguments passed to the parent AnyWidget.	`()`
`**kwargs`	`dict`	Additional keyword arguments passed to the parent AnyWidget and used as visualization options.	`{}`

Methods:

Name	Description
`mapping`	Specify the mapping from data columns to visual properties.
`geom_ellipse`	Add an ellipse layer to the plot.
`geom_hair`	Add a hair (small oriented lines) layer to the plot.
`labs`	Add labels to the plot.
`geom_edge_link`	Add edge link geometry to the plot.
`inter_edge_link`	Add interactive edge link geometry to the plot.
`inter_isometry`	Add interactive isometry overlays to the plot.
`scale_color`	Add a color scale to the plot.
`scale_size`	Add a size scale to the plot.
`inter_boxplot`	Add an interactive boxplot layer for distortion metrics, using provided distance summaries and outlier information.
`save`	Save the current view to SVG.

Examples:

>>> import pandas as pd
>>> df = pd.DataFrame({...})
>>> dplot(df).mapping(x='embedding_1', y='embedding_2').geom_ellipse()

Source code in distortions/visualization/interactive.py

class dplot(anywidget.AnyWidget):
    """
    Interactive Distortion Plot Widget

    This class provides an interactive widget for visualizing distortion metrics
    computed on datasets, with a ggplot2-like syntax for adding graphical marks
    and overlaying distortion criteria. It is designed for use in Jupyter
    environments and leverages the anywidget and traitlets libraries for
    interactivity. You can pause mouseover interactivity by holding down the
    control key.

    Parameters
    ----------
    df : pandas.DataFrame
        The input dataset to visualize. Must be convertible to a list of records.
    *args : tuple
        Additional positional arguments passed to the parent AnyWidget.
    **kwargs : dict
        Additional keyword arguments passed to the parent AnyWidget and used as
        visualization options.

    Methods
    -------
    mapping(**kwargs)
        Specify the mapping from data columns to visual properties.
    geom_ellipse(**kwargs)
        Add an ellipse layer to the plot.
    geom_hair(**kwargs)
        Add a hair (small oriented lines) layer to the plot.
    labs(**kwargs)
        Add labels to the plot.
    geom_edge_link(**kwargs)
        Add edge link geometry to the plot.
    inter_edge_link(**kwargs)
        Add interactive edge link geometry to the plot.
    inter_isometry(**kwargs)
        Add interactive isometry overlays to the plot.
    scale_color(**kwargs)
        Add a color scale to the plot.
    scale_size(**kwargs)
        Add a size scale to the plot.
    inter_boxplot(dists, **kwargs)
        Add an interactive boxplot layer for distortion metrics, using provided
        distance summaries and outlier information.
    save(filename)
        Save the current view to SVG.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({...})
    >>> dplot(df).mapping(x='embedding_1', y='embedding_2').geom_ellipse()
    """
    widget_dir = Path(__file__).parent / "widget"
    _esm = widget_dir / "render.js"
    _mapping = traitlets.Dict().tag(sync=True)
    dataset = traitlets.List().tag(sync=True)
    corrected = traitlets.List().tag(sync=True)
    layers = traitlets.List().tag(sync=True)
    neighbors = traitlets.List().tag(sync=True)
    distance_summaries = traitlets.List().tag(sync=True)
    outliers = traitlets.List().tag(sync=True)
    options = traitlets.Dict().tag(sync=True)
    elem_svg = traitlets.Unicode().tag(sync=True)

    def __init__(self, df, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.dataset = df.to_dict("records")
        self.options = kwargs

    def mapping(self, **kwargs):
        """
        Specify the Mapping
        """
        kwargs = {"angle": "angle", "a": "s1", "b": "s0", **kwargs}
        self._mapping = kwargs
        return self

    def geom_ellipse(self, **kwargs):
        self.layers = self.layers + [{"type": "geom_ellipse", "options": kwargs}]
        return self

    def geom_hair(self, **kwargs):
        self.layers = self.layers + [{'type': 'geom_hair', 'options': kwargs}]
        return self

    def labs(self, **kwargs):
        self.layers = self.layers + [{"type": "labs", "options": kwargs}]
        return self

    def geom_edge_link(self, **kwargs):
        self.layers = self.layers + [{"type": "geom_edge_link", "options": kwargs}]
        return self

    def inter_edge_link(self, **kwargs):
        self.layers = self.layers + [{"type": "inter_edge_link", "options": kwargs}]
        return self

    def inter_isometry(self, **kwargs):
        self.layers = self.layers + [{"type": "inter_isometry", "options": kwargs}]
        return self

    def scale_color(self, **kwargs):
        self.layers = self.layers + [{"type": "scale_color", "options": kwargs}]
        return self

    def scale_size(self, **kwargs):
        self.layers = self.layers + [{"type": "scale_size", "options": kwargs}]
        return self

    def inter_boxplot(self, dists, **kwargs):
        summaries, outliers = boxplot_data(dists["true"], dists["embedding"], **kwargs)
        outliers["center"] = dists.center.values[outliers["index"].values]
        outliers["neighbor"] = dists.neighbor.values[outliers["index"].values]

        # pass the related data to the visualization
        self.layers = self.layers + [{"type": "inter_boxplot", "options": kwargs}]
        self.distance_summaries = summaries.to_dict("records")
        self.outliers = outliers.to_dict("records")
        return self

    def save(self, filename="plot.svg"):
        self.send({"type": "save"})
        with open(filename, "w") as f:
            f.write(self.elem_svg)
        f.close()

    def correct(self):
        return pd.DataFrame(self.corrected)

`mapping(**kwargs)`

Specify the Mapping

Source code in distortions/visualization/interactive.py

def mapping(self, **kwargs):
    """
    Specify the Mapping
    """
    kwargs = {"angle": "angle", "a": "s1", "b": "s0", **kwargs}
    self._mapping = kwargs
    return self

`scanpy_umap(adata, max_cells=200, n_neighbors=10, n_pcs=40)`

Runs UMAP visualization on an AnnData object with basic preprocessing.

This wrapper function filters genes by minimum count, applies log transformation, selects highly variable genes, computes neighbors in PCA space, and runs UMAP.

Parameters:

Name	Type	Description	Default
`adata`	`AnnData`	AnnData experiment object containing the data to filter, transform, and apply UMAP to.	required
`max_cells`	`int, optional (default: 200)`	Maximum number of cells to use for visualization.	`200`
`n_neighbors`	`int, optional (default: 10)`	Number of neighbors to use for constructing the neighborhood graph.	`10`
`n_pcs`	`int, optional (default: 40)`	Number of principal components to use for neighborhood graph construction.	`40`

Returns:

Name	Type	Description
`adata`	`AnnData`	The AnnData object after preprocessing and UMAP computation.

Notes

The function modifies the input AnnData object in place.

Examples:

>>> import scanpy as sc
>>> from distortion.visualization import scanpy_umap
>>> adata = sc.datasets.pbmc3k()
>>> adata_umap = scanpy_umap(adata, max_cells=100, n_neighbors=15, n_pcs=30)
>>> sc.pl.umap(adata_umap)

Source code in distortions/visualization/umap.py

def scanpy_umap(adata, max_cells=200, n_neighbors=10, n_pcs=40):
    """
    Runs UMAP visualization on an AnnData object with basic preprocessing.

    This wrapper function filters genes by minimum count, applies log
    transformation, selects highly variable genes, computes neighbors in PCA
    space, and runs UMAP.

    Parameters
    ----------
    adata : anndata.AnnData
        AnnData experiment object containing the data to filter, transform, and
        apply UMAP to.
    max_cells : int, optional (default: 200)
        Maximum number of cells to use for visualization.
    n_neighbors : int, optional (default: 10)
        Number of neighbors to use for constructing the neighborhood graph.
    n_pcs : int, optional (default: 40)
        Number of principal components to use for neighborhood graph construction.

    Returns
    -------
    adata : anndata.AnnData
        The AnnData object after preprocessing and UMAP computation.

    Notes
    -----
    The function modifies the input AnnData object in place.

    Examples
    --------
    >>> import scanpy as sc
    >>> from distortion.visualization import scanpy_umap
    >>> adata = sc.datasets.pbmc3k()
    >>> adata_umap = scanpy_umap(adata, max_cells=100, n_neighbors=15, n_pcs=30)
    >>> sc.pl.umap(adata_umap)
    """
    adata = adata[:max_cells, :]

    # Preprocess the dataset
    sc.pp.filter_genes(adata, min_counts=1)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, min_mean=0.5, min_disp=0.5)
    adata = adata[:, adata.var.highly_variable]

    # Run UMAP
    sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=n_pcs)
    sc.tl.umap(adata)
    return adata