harpy.tb.preprocess_proteomics

harpy.tb.preprocess_proteomics#

harpy.tb.preprocess_proteomics(sdata, labels_name, table_name, output_table_name, calculate_cell_size=True, size_norm=True, library_norm=False, log1p=True, scale=False, max_value_scale=10, q=None, max_value_q=1, calculate_pca=False, n_comps=50, instance_size_key='shapeSize', raw_counts_key='raw_counts', overwrite=False)#

Preprocess a table (AnnData) attribute of a SpatialData object for proteomics data.

Performs optional normalization (on size or via normalize_total()), log transformation (log1p()), scaling (scale())/ quantile normalization and PCA calculation (pca()) for proteomics data contained in sdata.

Parameters:
  • sdata (SpatialData) – The input SpatialData object.

  • labels_name (str | Iterable[str]) – The labels element(s) of sdata used to select the cells via the region key in sdata.tables[table_name].obs. Note that if output_table_name is equal to table_name and overwrite is True, cells in sdata.tables[table_name] linked to other labels_name (via the region key), will be removed from sdata.tables[table_name]. If a list of labels elements is provided, they will therefore be preprocessed together (e.g. multiple samples).

  • table_name (str) – The table element in sdata to apply preprocessing to. It is an AnnData object containing total intensities per cell in .obs (rows) and per channel in .var (columns).

  • output_table_name (str) – The output table element in sdata to which the preprocessed table will be written.

  • calculate_cell_size (bool (default: True)) – If True, calculates cell sizes from labels_name and stores them in .obs[instance_size_key]. Set this to False when cell sizes are not needed or are already present and should be preserved.

  • size_norm (bool (default: True)) – If True, normalization is based on the size of the nucleus/cell. Resulting values are multiplied by 100 after normalization.

  • library_norm (bool (default: False)) – If True, normalize_total() is used for normalization.

  • log1p (bool (default: True)) – If True, applies log1p transformation to the data (via log1p()), after optional normalization.

  • scale (bool (default: False)) – If True, scales the data to have zero mean and a variance of one. The scaling is capped at max_value_scale.

  • max_value_scale (float | None (default: 10)) – The maximum value to which data will be scaled. Ignored if scale is False.

  • q (float | None (default: None)) – Quantile used for normalization. If specified, values are normalized by this quantile calculated for each adata.var. Typical value used is 0.999. Resulting values are multiplied by 100 after normalization.

  • max_value_q (float | None (default: 1)) – The maximum value to which data will be scaled when performing quantile normalization. Ignored if q is None. Typical value is 1. Resulting values are multiplied by 100 after normalization.

  • calculate_pca (bool (default: False)) – If True, calculates principal component analysis (PCA) on the data.

  • n_comps (int (default: 50)) – Number of principal components to calculate. Ignored if calculate_pca is False.

  • instance_size_key (str (default: 'shapeSize')) – The key in the AnnData table .obs that holds the size of the instances. When size_norm is True, this column must either already exist in .obs or calculate_cell_size must be True.

  • raw_counts_key (str (default: 'raw_counts')) – Name of the AnnData layer where the non-preprocessed intensities will be stored. This parameter is ignored if no preprocessing is applied (i.e. size_norm, library_norm, log1p, scale are all False and q is None).

  • overwrite (bool (default: False)) – If True, overwrites the output_table_name if it already exists in sdata.

Return type:

SpatialData

Returns:

: The sdata containing the preprocessed AnnData object as an attribute (sdata.tables[output_table_name]).

Raises:

ValueError

  • If sdata does not contain any labels elements. - If sdata does not contain any table elements. - If labels_name, or one of the elements of labels_name, is not a labels element in sdata. - If table_name is not a table element in sdata. - If both scale is set to True and q is not None.

Warning

  • If scale is True and max_value_scale is set too low, it may overly constrain the variability of the data, potentially impacting downstream analyses.

  • If the dimensionality of sdata.tables[table_name] is smaller than the desired number of principal components when calculate_pca is True, n_comps is set to the minimum dimensionality, and a message is printed.

See also

harpy.tb.allocate_intensity

create an AnnData table in sdata using an image_name and a labels_name.