harpy.tb.preprocess_transcriptomics

harpy.tb.preprocess_transcriptomics#

harpy.tb.preprocess_transcriptomics(sdata, labels_name, table_name, output_table_name, percent_top=(2, 5), min_counts=10, min_cells=5, size_norm=True, library_norm=False, highly_variable_genes=False, highly_variable_genes_kwargs=mappingproxy({}), max_value_scale=10, n_comps=50, update_shapes_elements=True, instance_size_key='shapeSize', raw_counts_key='raw_counts', overwrite=False)#

Preprocess a table (AnnData) attribute of a SpatialData object for transcriptomics data.

Performs filtering (via filter_cells() and filter_genes() ) and optional normalization (on size or via normalize_total()), log transformation (log1p()), highly variable genes selection (highly_variable_genes()), scaling (scale()), and PCA calculation (pca()) for transcriptomics data contained in the sdata.tables[table_name]. QC metrics are added to sdata.tables[output_table_name].obs using calculate_qc_metrics().

Parameters:
  • sdata (SpatialData) – The input SpatialData object.

  • labels_name (str | Iterable[str]) – The labels element(s) of sdata used to select the cells via the region key in sdata.tables[table_name].obs. Note that if output_table_name is equal to table_name and overwrite is True, cells in sdata.tables[table_name] linked to other labels_name (via the region key), will be removed from sdata.tables[table_name] (also from the backing zarr store if it is backed).

  • table_name (str) – The table element in sdata on which to perform preprocessing.

  • output_table_name (str) – The output table element in sdata to which the preprocessed table will be written.

  • percent_top (tuple[int, ...] (default: (2, 5))) – List of ranks (where genes are ranked by expression) at which the cumulative proportion of expression will be reported as a percentage. Passed to calculate_qc_metrics().

  • min_counts (int (default: 10)) – Minimum number of genes a cell should contain to be kept (passed to filter_cells()).

  • min_cells (int (default: 5)) – Minimum number of cells a gene should be in to be kept (passed to filter_genes()).

  • size_norm (bool (default: True)) – If True, normalization is based on the size of the nucleus/cell. Resulting values are multiplied by 100 after normalization.

  • library_norm (bool (default: False)) – If True, normalize_total() is used for normalization.

  • highly_variable_genes (bool (default: False)) – If True, will only retain highly variable genes, as calculated by highly_variable_genes().

  • highly_variable_genes_kwargs (Mapping[str, Any] (default: mappingproxy({}))) – Keyword arguments passed to highly_variable_genes(). Ignored if highly_variable_genes is False.

  • max_value_scale (float | None (default: 10)) – The maximum value to which data will be scaled, when scaling the data to have zero mean and a variance of one, using scale().

  • n_comps (int (default: 50)) – Number of principal components to calculate.

  • update_shapes_elements (bool (default: True)) – Whether to filter the shapes elements associated with labels_name. If set to True, cells that do not appear in resulting output_table_name (with the region key equal to labels_name) will be removed from the shapes elements (via region key) in the sdata object. Filtered shapes will be added to sdata with prefix ‘filtered_low_counts’. This parameter is deprecated, and will be removed in a future version.

  • instance_size_key (str (default: 'shapeSize')) – The key in the AnnData table .obs that holds the size of the instances. If instance_size_key is not found in .obs it will be calculated from the labels_name. Ignored if size_norm is False.

  • raw_counts_key (str (default: 'raw_counts')) – Name of the AnnData layer where the non-preprocessed counts will be stored.

  • overwrite (bool (default: False)) – If True, overwrites the output_table_name if it already exists in sdata.

Return type:

SpatialData

Returns:

: The sdata containing the preprocessed AnnData object as an attribute (sdata.tables[output_table_name]).

Raises:
  • ValueError – If sdata does not have labels attribute.

  • ValueError – If sdata does not have tables attribute.

  • ValueError – If labels_name, or one of the elements of labels_name, is not a labels element in sdata.

  • ValueError – If table_name is not a table element in sdata.

Warning

  • If max_value_scale is set too low, it may overly constrain the variability of the data, potentially impacting downstream analyses.

  • If the dimensionality of sdata.tables[table_name] is smaller than the desired number of principal components, n_comps is set to the minimum dimensionality, and a message is printed.

See also

harpy.tb.allocate

create an AnnData table in sdata using a points_name and a labels_name.