harpy.im.flowsom

Contents

harpy.im.flowsom#

harpy.im.flowsom(sdata, image_name, output_cluster_labels_name, output_metacluster_labels_name, channels=None, fraction=0.1, n_clusters=5, random_state=100, chunks=None, scale_factors=None, client=None, persist_intermediate=True, write_intermediate=True, instance_key='cell_ID', region_key='fov_labels', spatial_key='spatial', overwrite=False, **kwargs)#

Applies flowsom clustering on image element(s) of a SpatialData object.

This function executes the flowsom clustering algorithm (via fs.FlowSOM) on spatial data encapsulated by a SpatialData object. The predicted clusters and metaclusters are added as labels elements to respectively sdata.labels[output_cluster_labels_name] and sdata.labels[output_metacluster_labels_name].

Parameters:
  • sdata (SpatialData) – The input SpatialData object.

  • image_name (str | Iterable[str]) – The image element(s) of sdata on which FlowSOM is run. It is recommended to preprocess the data with harpy.im.pixel_clustering_preprocess().

  • output_cluster_labels_name (str | Iterable[str]) – The output labels element in sdata to which the predicted FlowSOM SOM clusters are saved.

  • output_metacluster_labels_name (str | Iterable[str]) – The output labels element in sdata to which the predicted FlowSOM metaclusters are saved.

  • channels (int | str | Iterable[int] | Iterable[str] | None (default: None)) – Specifies the channels to be included in the pixel clustering.

  • fraction (float | None (default: 0.1)) – Fraction of the data to sample for training FlowSOM. Inference will be done on all pixels in image_name.

  • n_clusters (int (default: 5)) – The number of meta clusters to form.

  • random_state (int (default: 100)) – A random state for reproducibility of the clustering and sampling.

  • chunks (str | int | tuple[int, ...] | None (default: None)) – Chunk sizes used for flowsom inference step on image_name. If provided as a tuple, it should contain chunk sizes for c, (z), y, x.

  • scale_factors (Sequence[dict[str, int] | int] | None (default: None)) – Scale factors to apply for multiscale

  • client (Client | None (default: None)) – A Dask Client instance. If specified, during inference, the trained fs.models.BaseFlowSOMEstimator model will be scattered (client.scatter(...)). This reduces the size of the task graph and can improve performance by minimizing data transfer overhead during computation. If not specified, Dask will use the default scheduler as configured on your system (e.g., single-threaded, multithreaded, or a global client if one is running).

  • persist_intermediate (bool (default: True)) – If set to True will persit intermediate computation in memory. If image_name, or one of the elements in image_name is large, this could lead to increased ram usage. Set to False to write to intermediate zarr store instead, which will reduce ram usage, but will increase computation time slightly. We advice to set persist_intermediate to True, as it will only persist an array of dimension (2,z,y,x), of dtype numpy.uint8. Ignored if sdata is not backed by a Zarr store.

  • write_intermediate (bool (default: True)) – If set to True, an intermediate Zarr store will be used during sampling from image_name for flowsom training. Enable this option to reduce RAM usage, especially if image_name or any of its components is large. Ignored if sdata is not backed by a Zarr store.

  • instance_key (str (default: 'cell_ID')) – Instance key. The name of the column in the .obs attribute of the AnnData table at slot “cell_data” of the mudata.MuData object (which is an attribute of the returned flowsom.FlowSOM object) that will hold the instance ids.

  • region_key (str (default: 'fov_labels')) – Region key. The name of the column in the .obs attribute of the AnnData table at slot “cell_data” of the mudata.MuData object (which is an attribute of the returned flowsom.FlowSOM object) that will hold the name of the image element that is annotated by the table.

  • spatial_key (str (default: 'spatial')) – Spatial key. The name of the slot in the .obsm attribute of the AnnData table at slot “cell_data” of the mudata.MuData object (which is an attribute of the returned flowsom.FlowSOM object) that will hold the (z),y,x coordinate of the pixel.

  • overwrite (bool (default: False)) – If True, overwrites the output_cluster_labels_name and/or output_metacluster_labels_name if it already exists in sdata.

  • **kwargs – Additional keyword arguments passed to fs.FlowSOM.

Return type:

tuple[SpatialData, FlowSOM, Series]

Returns:

: tuple:

  • The input sdata with the clustering results added.

  • FlowSOM object containing a MuData object and a trained fs.models.FlowSOMEstimator. MuData object will only contain the fraction (via the fraction parameter) of the data sampled from the image_name on which the FlowSOM model is trained.

  • A pandas Series object containing a mapping between the clusters and the metaclusters.

See also

harpy.im.pixel_clustering_preprocess

preprocess image elements before applying FlowSOM clustering.

Warning

  • The function is intended for use with spatial proteomics data. Input data should be appropriately preprocessed (e.g. via harpy.im.pixel_clustering_preprocess()) to ensure meaningful clustering results.

  • The cluster and metacluster ID’s found in output_cluster_labels_name and output_metacluster_labels_name count from 1, while they count from 0 in the FlowSOM object.