harpy.tb.preprocess_transcriptomics

harpy.tb.preprocess_transcriptomics#

harpy.tb.preprocess_transcriptomics(sdata, labels_name, table_name, output_table_name, percent_top=(2, 5), min_counts=10, min_cells=5, size_norm=True, library_norm=False, highly_variable_genes=False, highly_variable_genes_kwargs=mappingproxy({}), max_value_scale=10, n_comps=50, update_shapes_elements=True, instance_size_key='shapeSize', raw_counts_key='raw_counts', overwrite=False)#

Preprocess a table (AnnData) attribute of a SpatialData object for transcriptomics data.

Performs filtering (via filter_cells() and filter_genes() ) and optional normalization (on size or via normalize_total()), log transformation (log1p()), highly variable genes selection (highly_variable_genes()), scaling (scale()), and PCA calculation (pca()) for transcriptomics data contained in the sdata.tables[table_name]. QC metrics are added to sdata.tables[output_table_name].obs using calculate_qc_metrics().

Parameters:

sdata (SpatialData) – The input SpatialData object.
labels_name (str | Iterable[str]) – The labels element(s) of sdata used to select the cells via the region key in sdata.tables[table_name].obs. Note that if output_table_name is equal to table_name and overwrite is True, cells in sdata.tables[table_name] linked to other labels_name (via the region key), will be removed from sdata.tables[table_name] (also from the backing zarr store if it is backed).
table_name (str) – The table element in sdata on which to perform preprocessing.
output_table_name (str) – The output table element in sdata to which the preprocessed table will be written.
percent_top (tuple[int, ...] (default: (2, 5))) – List of ranks (where genes are ranked by expression) at which the cumulative proportion of expression will be reported as a percentage. Passed to calculate_qc_metrics().
min_counts (int (default: 10)) – Minimum number of genes a cell should contain to be kept (passed to filter_cells()).
min_cells (int (default: 5)) – Minimum number of cells a gene should be in to be kept (passed to filter_genes()).
size_norm (bool (default: True)) – If True, normalization is based on the size of the nucleus/cell. Resulting values are multiplied by 100 after normalization.
library_norm (bool (default: False)) – If True, normalize_total() is used for normalization.
highly_variable_genes (bool (default: False)) – If True, will only retain highly variable genes, as calculated by highly_variable_genes().
highly_variable_genes_kwargs (Mapping[str, Any] (default: mappingproxy({}))) – Keyword arguments passed to highly_variable_genes(). Ignored if highly_variable_genes is False.
max_value_scale (float | None (default: 10)) – The maximum value to which data will be scaled, when scaling the data to have zero mean and a variance of one, using scale().
n_comps (int (default: 50)) – Number of principal components to calculate.
update_shapes_elements (bool (default: True)) – Whether to filter the shapes elements associated with labels_name. If set to True, cells that do not appear in resulting output_table_name (with the region key equal to labels_name) will be removed from the shapes elements (via region key) in the sdata object. Filtered shapes will be added to sdata with prefix ‘filtered_low_counts’. This parameter is deprecated, and will be removed in a future version.
instance_size_key (str (default: 'shapeSize')) – The key in the AnnData table .obs that holds the size of the instances. If instance_size_key is not found in .obs it will be calculated from the labels_name. Ignored if size_norm is False.
raw_counts_key (str (default: 'raw_counts')) – Name of the AnnData layer where the non-preprocessed counts will be stored.
overwrite (bool (default: False)) – If True, overwrites the output_table_name if it already exists in sdata.

Return type:

SpatialData

Returns:

: The sdata containing the preprocessed AnnData object as an attribute (sdata.tables[output_table_name]).

Raises:

ValueError – If sdata does not have labels attribute.
ValueError – If sdata does not have tables attribute.
ValueError – If labels_name, or one of the elements of labels_name, is not a labels element in sdata.
ValueError – If table_name is not a table element in sdata.

Warning

If max_value_scale is set too low, it may overly constrain the variability of the data, potentially impacting downstream analyses.
If the dimensionality of sdata.tables[table_name] is smaller than the desired number of principal components, n_comps is set to the minimum dimensionality, and a message is printed.

harpy.tb.preprocess_transcriptomics

Contents

harpy.tb.preprocess_transcriptomics#