harpy.tb.kmeans

harpy.tb.kmeans#

harpy.tb.kmeans(sdata, labels_name, table_name, output_table_name, calculate_umap=True, rank_genes=True, n_neighbors=35, n_pcs=17, n_clusters=5, key_added='kmeans', index_names_var=None, index_positions_var=None, random_state=100, overwrite=False, **kwargs)#

Applies KMeans clustering on the table_name of the SpatialData object with optional UMAP calculation and gene ranking.

This function executes the KMeans clustering algorithm (via KMeans) on spatial data encapsulated by a SpatialData object. It optionally computes a UMAP (Uniform Manifold Approximation and Projection) for dimensionality reduction and ranks genes based on their contributions to the clustering. The clustering results, along with optional UMAP and gene ranking, are added to the sdata.tables[output_table_name] for downstream analysis.

Parameters:

sdata (SpatialData) – The input SpatialData object.
labels_name (str | HashableList[str] | None) – The labels element(s) of sdata used to select the cells via the region key in sdata.tables[table_name].obs. Note that if output_table_name is equal to table_name and overwrite is True, cells in sdata.tables[table_name] linked to other labels_name (via the region key), will be removed from sdata.tables[table_name]. If a list of labels elements is provided, they will therefore be clustered together (e.g. multiple samples).
table_name (str) – The table element in sdata on which to perform clustering.
output_table_name (str) – The output table element in sdata to which table element with results of clustering will be written.
calculate_umap (bool (default: True)) – If True, calculates a UMAP via umap() for visualization of computed clusters.
rank_genes (bool (default: True)) – If True, ranks genes based on their contributions to the clusters via rank_genes_groups(), with default parameters. Note that rank_genes_groups() will be run on the .raw attribute of the AnnData table, if .raw is not None.
n_neighbors (int (default: 35)) – The number of neighbors to consider when calculating neighbors via neighbors(). Ignored if calculate_umap is False.
n_pcs (int (default: 17)) – The number of principal components to use when calculating neighbors via neighbors(). Ignored if calculate_umap is False.
n_clusters (int (default: 5)) – The number of clusters to form.
key_added (default: 'kmeans') – The key under which the clustering results are added to the SpatialData object (in sdata.tables[table_name].obs).
index_names_var (Iterable[str] | None (default: None)) – List of index names to subset in sdata.tables[table_name].var. index_positions_var will be used if None.
index_positions_var (Iterable[int] | None (default: None)) – List of integer positions to subset in sdata.tables[table_name].var. Used if index_names_var is None.
random_state (int (default: 100)) – A random state for reproducibility of the clustering.
overwrite (bool (default: False)) – If True, overwrites the output_table_name if it already exists in sdata.
**kwargs – Additional keyword arguments passed to the KMeans algorithm (KMeans).

Returns:

: The input sdata with the clustering results added.

Notes

The function adds a table element, adding clustering labels, and optionally UMAP coordinates and gene rankings, facilitating downstream analyses and visualization.
Gene ranking based on cluster contributions is intended for identifying marker genes that characterize each cluster.

Warning

The function is intended for use with spatial omics data. Input data should be appropriately preprocessed (e.g. via preprocess_transcriptomics() or preprocess_proteomics()) to ensure meaningful clustering results.
The rank_genes functionality is marked for relocation to enhance modularity and clarity of the codebase.

harpy.tb.kmeans

Contents

harpy.tb.kmeans#