harpy.tb.featurize

Contents

harpy.tb.featurize#

harpy.tb.featurize(sdata, image_name, labels_name, table_name, output_table_name, diameter, embedding_dimension, depth=None, remove_background=True, model=<function _dummy_embedding>, batch_size=None, model_kwargs=mappingproxy({}), embedding_obsm_key='embedding', to_coordinate_system='global', instance_key='cell_ID', region_key='fov_labels', cell_index_name='cells', dtype=<class 'numpy.float32'>, run_on_gpu=False, store_intermediate=False, overwrite=False, **kwargs)#

Extract per-instance feature vectors from image_name and labels_name using a user-provided embedding model.

This method constructs a Dask graph that, for each non-zero label in labels_name, extracts a centered (y, x) window (size set by diameter or 2 * depth) from image_name, optionally removes background pixels outside the labeled object (also in the corresponding image_name), and feeds the resulting instance cutout (with preserved z and channel dimensions) through model to produce an embedding of size embedding_dimension.

Internally, instance windows are generated lazily (via dask.array.map_overlap() and dask.array.map_blocks()) and then batched along the instance dimension to evaluate model in parallel. The output is a Dask array of shape (i, d), where i is the number of non-zero labels and d == embedding_dimension.

The resulting feature vectors are computed and added to sdata[output_table_name].obsm[embedding_obsm_key] as a NumPy array. If table_name is None, an empty table element is created at output_table_name. Otherwise, the feature vectors are sorted and filtered according to sdata[table_name].obs[_INSTANCE_KEY], and similarly added to sdata[output_table_name].obsm[embedding_obsm_key].

For optimal performance, configure dask to use processes, e.g.: dask.config.set(scheduler="processes").

Note:

Decreasing the chunk size of the provided image and mask arrays will reduce RAM usage. A good first guess for image/mask chunking is (c_chunksize, y_chunksize, x_chunksize) = (10, 2048, 2048).

Parameters:
  • sdata (SpatialData) – SpatialData object.

  • image_name (str | list[str]) – Name of the image element.

  • labels_name (str | list[str]) – Name of the labels element.

  • table_name (str | None) – Name of the table element. If table_name is None, an empty table will be created with the calculated embeddings at .obsm[embedding_obsm_key], and annotated by labels_name.

  • output_table_name (str) – Name of the output table element. Can be set equal to table_name if overwrite is set to True.

  • diameter (int) – Side length of the resulting y, x window for every instance.

  • embedding_dimension (int) – The dimensionality d of the feature vectors returned by model. The returned Dask array will have shape (i, embedding_dimension).

  • depth (int | None (default: None)) – Passed to dask.array.map_overlap(). For correct results, choose depth to be roughly half of the estimated maximum diameter or larger. If not provided depth is set to diameter//2 +1.

  • remove_background (bool (default: True)) – If True (default), pixels outside the instance label within each window are set to background (e.g., zero) so that only the object remains inside the cutout. If False, the full window content is passed to model.

  • model (Callable[..., ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]] (default: <function _dummy_embedding at 0x742f8caa65c0>)) – A callable that maps a batch of instance windows to embeddings: (batch_size, c,z,y,x)->(batch_size, embedding_dimension) , e.g. model(batch, **model_kwargs) -> np.ndarray. The callable should accept NumPy arrays; Dask will handle chunking and batching. The callable must include the parameter ‘embedding_dimension’

  • batch_size (int | None (default: None)) – Chunk size of the resulting Dask array in the instance dimension i during model evaluation. Lower values can reduce (GPU) memory usage during model evaluation, but at the cost of more overhead (rechunking).

  • model_kwargs (Mapping[str, Any] (default: mappingproxy({}))) – Extra keyword arguments forwarded to model at call time (e.g., device selection, inference flags).

  • embedding_obsm_key (default: 'embedding') – Name of the feature matrix added to sdata[output_table_name].obsm.

  • to_coordinate_system (str | list[str] (default: 'global')) – The coordinate system that holds image_name and labels_name.

  • instance_key (str (default: 'cell_ID')) – Instance key. The name of the column in adata.obs that will hold the instance ids. Ignored if table_name is not None.

  • region_key (str (default: 'fov_labels')) – Region key. The name of the column in adata.obs that holds the name of the elements (region) that are annotated by the table element. Ignored if table_name is not None.

  • cell_index_name (str (default: 'cells')) – The name of the index of the resulting AnnData table. Ignored if table_name is not None.

  • dtype (dtype (default: <class 'numpy.float32'>)) – Output dtype of model.

  • run_on_gpu (bool (default: False)) – If True and ‘cupy’ is installed, the instance extraction step runs on the GPU.

  • store_intermediate (bool (default: False)) – If True, intermediate and temporary feature stores are written to disk during featurization. This is only supported for backed SpatialData; if sdata.is_backed() is False, a ValueError is raised.

  • overwrite (bool (default: False)) – If True, overwrites the output_table_name if it already exists in sdata.

  • **kwargs (Any) – Additional keyword arguments forwarded to dask.array.map_blocks(). Use with care.

Return type:

SpatialData

Returns:

: Spatialdata object.

Examples

Allocate intensity statistics and compute embeddings using a custom model:

import numpy as np
import harpy as hp

sdata = hp.datasets.pixie_example()

image_name = "raw_image_fov0"
labels_name = "label_whole_fov0"

# First, create an AnnData table by allocating intensity statistics
# Note that this step is optional.
sdata = hp.tb.allocate_intensity(
    sdata,
    image_name=image_name,
    labels_name=labels_name,
    to_coordinate_system="fov0",
    output_table_name="my_table",
    mode="sum",
    obs_stats="count",  # cell size
    overwrite=True,
)

# Define a custom embedding model
def my_model(
    batch,
    normalize: bool = True,
    embedding_dimension: int = 64,
) -> np.ndarray:
    # batch: (b, c, z, y, x) -> return (b, d)
    vecs = batch.reshape(batch.shape[0], -1).astype(np.float32)

    if normalize:
        norms = np.linalg.norm(vecs, axis=1, keepdims=True) + 1e-8
        vecs = vecs / norms

    # Project to desired embedding dimension (toy example)
    W = np.random.RandomState(0).randn(
        vecs.shape[1],
        embedding_dimension,
    ).astype(np.float32)

    return vecs @ W

# Add embeddings to the table
sdata = hp.tb.featurize(
    sdata,
    image_name=image_name,
    labels_name=labels_name,
    table_name="my_table",
    output_table_name="my_table",
    depth=96,
    embedding_dimension=64,
    diameter=192,
    model=my_model,
    model_kwargs={"normalize": True},
    batch_size=100,
    to_coordinate_system="fov0",
    embedding_obsm_key="embedding",
)

# Access the computed embedding for each instance
sdata["my_table"].obsm["embedding"]