harpy.tb.preprocess_transcriptomics#
- harpy.tb.preprocess_transcriptomics(sdata, labels_name, table_name, output_table_name, percent_top=(2, 5), min_counts=10, min_cells=5, size_norm=True, library_norm=False, highly_variable_genes=False, highly_variable_genes_kwargs=mappingproxy({}), max_value_scale=10, n_comps=50, update_shapes_elements=True, instance_size_key='shapeSize', raw_counts_key='raw_counts', overwrite=False)#
Preprocess a table (AnnData) attribute of a SpatialData object for transcriptomics data.
Performs filtering (via
filter_cells()andfilter_genes()) and optional normalization (on size or vianormalize_total()), log transformation (log1p()), highly variable genes selection (highly_variable_genes()), scaling (scale()), and PCA calculation (pca()) for transcriptomics data contained in thesdata.tables[table_name]. QC metrics are added tosdata.tables[output_table_name].obsusingcalculate_qc_metrics().- Parameters:
sdata (
SpatialData) – The input SpatialData object.labels_name (
str|Iterable[str]) – The labels element(s) ofsdataused to select the cells via the region key insdata.tables[table_name].obs. Note that ifoutput_table_nameis equal totable_nameand overwrite isTrue, cells insdata.tables[table_name]linked to otherlabels_name(via the region key), will be removed fromsdata.tables[table_name](also from the backing zarr store if it is backed).table_name (
str) – The table element insdataon which to perform preprocessing.output_table_name (
str) – The output table element insdatato which the preprocessed table will be written.percent_top (
tuple[int,...] (default:(2, 5))) – List of ranks (where genes are ranked by expression) at which the cumulative proportion of expression will be reported as a percentage. Passed tocalculate_qc_metrics().min_counts (
int(default:10)) – Minimum number of genes a cell should contain to be kept (passed tofilter_cells()).min_cells (
int(default:5)) – Minimum number of cells a gene should be in to be kept (passed tofilter_genes()).size_norm (
bool(default:True)) – IfTrue, normalization is based on the size of the nucleus/cell. Resulting values are multiplied by 100 after normalization.library_norm (
bool(default:False)) – IfTrue,normalize_total()is used for normalization.highly_variable_genes (
bool(default:False)) – IfTrue, will only retain highly variable genes, as calculated byhighly_variable_genes().highly_variable_genes_kwargs (
Mapping[str,Any] (default:mappingproxy({}))) – Keyword arguments passed tohighly_variable_genes(). Ignored ifhighly_variable_genesisFalse.max_value_scale (
float|None(default:10)) – The maximum value to which data will be scaled, when scaling the data to have zero mean and a variance of one, usingscale().n_comps (
int(default:50)) – Number of principal components to calculate.update_shapes_elements (
bool(default:True)) – Whether to filter the shapes elements associated withlabels_name. If set toTrue, cells that do not appear in resultingoutput_table_name(with the region key equal tolabels_name) will be removed from the shapes elements (via region key) in thesdataobject. Filtered shapes will be added tosdatawith prefix ‘filtered_low_counts’. This parameter is deprecated, and will be removed in a future version.instance_size_key (
str(default:'shapeSize')) – The key in theAnnDatatable.obsthat holds the size of the instances. Ifinstance_size_keyis not found in.obsit will be calculated from thelabels_name. Ignored ifsize_normisFalse.raw_counts_key (
str(default:'raw_counts')) – Name of theAnnDatalayer where the non-preprocessed counts will be stored.overwrite (
bool(default:False)) – If True, overwrites theoutput_table_nameif it already exists insdata.
- Return type:
- Returns:
: The
sdatacontaining the preprocessed AnnData object as an attribute (sdata.tables[output_table_name]).- Raises:
ValueError – If
sdatadoes not have labels attribute.ValueError – If
sdatadoes not have tables attribute.ValueError – If
labels_name, or one of the elements oflabels_name, is not a labels element insdata.ValueError – If
table_nameis not a table element insdata.
Warning
If
max_value_scaleis set too low, it may overly constrain the variability of the data, potentially impacting downstream analyses.If the dimensionality of
sdata.tables[table_name]is smaller than the desired number of principal components,n_compsis set to the minimum dimensionality, and a message is printed.
See also
harpy.tb.allocatecreate an AnnData table in
sdatausing apoints_nameand alabels_name.