harpy.tb.featurize#
- harpy.tb.featurize(sdata, image_name, labels_name, table_name, output_table_name, diameter, embedding_dimension, depth=None, remove_background=True, model=<function _dummy_embedding>, batch_size=None, model_kwargs=mappingproxy({}), embedding_obsm_key='embedding', to_coordinate_system='global', instance_key='cell_ID', region_key='fov_labels', cell_index_name='cells', dtype=<class 'numpy.float32'>, run_on_gpu=False, store_intermediate=False, overwrite=False, **kwargs)#
Extract per-instance feature vectors from
image_nameandlabels_nameusing a user-provided embeddingmodel.This method constructs a Dask graph that, for each non-zero label in
labels_name, extracts a centered(y, x)window (size set bydiameteror2 * depth) fromimage_name, optionally removes background pixels outside the labeled object (also in the correspondingimage_name), and feeds the resulting instance cutout (with preservedzand channel dimensions) throughmodelto produce an embedding of sizeembedding_dimension.Internally, instance windows are generated lazily (via
dask.array.map_overlap()anddask.array.map_blocks()) and then batched along the instance dimension to evaluatemodelin parallel. The output is a Dask array of shape(i, d), whereiis the number of non-zero labels andd == embedding_dimension.The resulting feature vectors are computed and added to
sdata[output_table_name].obsm[embedding_obsm_key]as a NumPy array. Iftable_nameisNone, an empty table element is created atoutput_table_name. Otherwise, the feature vectors are sorted and filtered according tosdata[table_name].obs[_INSTANCE_KEY], and similarly added tosdata[output_table_name].obsm[embedding_obsm_key].For optimal performance, configure
daskto useprocesses, e.g.:dask.config.set(scheduler="processes").- Note:
Decreasing the chunk size of the provided image and mask arrays will reduce RAM usage. A good first guess for image/mask chunking is
(c_chunksize, y_chunksize, x_chunksize) = (10, 2048, 2048).
- Parameters:
sdata (
SpatialData) – SpatialData object.image_name (
str|list[str]) – Name of the image element.labels_name (
str|list[str]) – Name of the labels element.table_name (
str|None) – Name of the table element. Iftable_nameisNone, an empty table will be created with the calculated embeddings at.obsm[embedding_obsm_key], and annotated bylabels_name.output_table_name (
str) – Name of the output table element. Can be set equal totable_nameif overwrite is set toTrue.diameter (
int) – Side length of the resultingy,xwindow for every instance.embedding_dimension (
int) – The dimensionalitydof the feature vectors returned bymodel. The returned Dask array will have shape(i, embedding_dimension).depth (
int|None(default:None)) – Passed todask.array.map_overlap(). For correct results, choose depth to be roughly half of the estimated maximum diameter or larger. If not provideddepthis set todiameter//2 +1.remove_background (
bool(default:True)) – IfTrue(default), pixels outside the instance label within each window are set to background (e.g., zero) so that only the object remains inside the cutout. IfFalse, the full window content is passed tomodel.model (
Callable[...,ndarray[tuple[Any,...],dtype[TypeVar(_ScalarT, bound=generic)]]] (default:<function _dummy_embedding at 0x742f8caa65c0>)) – A callable that maps a batch of instance windows to embeddings:(batch_size, c,z,y,x)->(batch_size, embedding_dimension), e.g.model(batch, **model_kwargs) -> np.ndarray. The callable should accept NumPy arrays; Dask will handle chunking and batching. The callable must include the parameter ‘embedding_dimension’batch_size (
int|None(default:None)) – Chunk size of the resulting Dask array in the instance dimensioniduring model evaluation. Lower values can reduce (GPU) memory usage during model evaluation, but at the cost of more overhead (rechunking).model_kwargs (
Mapping[str,Any] (default:mappingproxy({}))) – Extra keyword arguments forwarded tomodelat call time (e.g., device selection, inference flags).embedding_obsm_key (default:
'embedding') – Name of the feature matrix added tosdata[output_table_name].obsm.to_coordinate_system (
str|list[str] (default:'global')) – The coordinate system that holdsimage_nameandlabels_name.instance_key (
str(default:'cell_ID')) – Instance key. The name of the column inadata.obsthat will hold the instance ids. Ignored iftable_nameis not None.region_key (
str(default:'fov_labels')) – Region key. The name of the column inadata.obsthat holds the name of the elements (region) that are annotated by the table element. Ignored iftable_nameis not None.cell_index_name (
str(default:'cells')) – The name of the index of the resultingAnnDatatable. Ignored iftable_nameis not None.dtype (
dtype(default:<class 'numpy.float32'>)) – Output dtype ofmodel.run_on_gpu (
bool(default:False)) – If True and ‘cupy’ is installed, the instance extraction step runs on the GPU.store_intermediate (
bool(default:False)) – IfTrue, intermediate and temporary feature stores are written to disk during featurization. This is only supported for backedSpatialData; ifsdata.is_backed()isFalse, aValueErroris raised.overwrite (
bool(default:False)) – IfTrue, overwrites theoutput_table_nameif it already exists insdata.**kwargs (
Any) – Additional keyword arguments forwarded todask.array.map_blocks(). Use with care.
- Return type:
- Returns:
: Spatialdata object.
Examples
Allocate intensity statistics and compute embeddings using a custom model:
import numpy as np import harpy as hp sdata = hp.datasets.pixie_example() image_name = "raw_image_fov0" labels_name = "label_whole_fov0" # First, create an AnnData table by allocating intensity statistics # Note that this step is optional. sdata = hp.tb.allocate_intensity( sdata, image_name=image_name, labels_name=labels_name, to_coordinate_system="fov0", output_table_name="my_table", mode="sum", obs_stats="count", # cell size overwrite=True, ) # Define a custom embedding model def my_model( batch, normalize: bool = True, embedding_dimension: int = 64, ) -> np.ndarray: # batch: (b, c, z, y, x) -> return (b, d) vecs = batch.reshape(batch.shape[0], -1).astype(np.float32) if normalize: norms = np.linalg.norm(vecs, axis=1, keepdims=True) + 1e-8 vecs = vecs / norms # Project to desired embedding dimension (toy example) W = np.random.RandomState(0).randn( vecs.shape[1], embedding_dimension, ).astype(np.float32) return vecs @ W # Add embeddings to the table sdata = hp.tb.featurize( sdata, image_name=image_name, labels_name=labels_name, table_name="my_table", output_table_name="my_table", depth=96, embedding_dimension=64, diameter=192, model=my_model, model_kwargs={"normalize": True}, batch_size=100, to_coordinate_system="fov0", embedding_obsm_key="embedding", ) # Access the computed embedding for each instance sdata["my_table"].obsm["embedding"]