harpy.utils.Featurizer#

class harpy.utils.Featurizer(mask_dask_array, image_dask_array=None, run_on_gpu=True)#

Helper class to featurize images and labels using Dask.

Parameters:
  • mask_dask_array (Array) – A 3D Dask array of integer labels representing segmented regions. Expected shape is (‘z’, ‘y’, ‘x’). Each unique integer value represents a separate label.

  • image_dask_array (Array | None (default: None)) – A 4D Dask array representing the image data with shape (‘c’, ‘z’, ‘y’, ‘x’), where ‘c’ is the number of channels. Can be None if only mask-based computations (e.g., count or center of mass) are required.

Raises:
  • ValueError – If mask_dask_array does not contain an integer dtype.

  • AssertionError – If mask_dask_array is not 3D.

  • AssertionError – If image_dask_array is is not 4D.

  • AssertionError – If spatial dimensions of image_dask_array and mask_dask_array do not match.

  • AssertionError – If chunk sizes of spatial dimensions do not match between image and mask.

Notes

  • Unique labels are computed once at initialization for efficiency.

  • Image and mask must be aligned in spatial dimensions and chunking to ensure accurate and efficient featurization.

Methods table#

calculate_instance_statistics(depth, ...[, ...])

Extract per-instance statistics using a user-provided callable fn.

extract_instances(diameter, depth[, ...])

Extract per-label instance windows from the mask and image of size diameter in y and x using dask.array.map_overlap() and dask.array.map_blocks().

featurize(embedding_dimension, diameter[, ...])

Extract per-instance feature vectors from the image/mask using a user-provided embedding model.

quantiles(diameter[, q, zarr_output_path, ...])

Compute per-instance intensity quantiles.

radii_and_principal_axes(diameter[, ...])

Compute per-instance radii and principal axes.

Methods#

Featurizer.calculate_instance_statistics(depth, statistic_dimension, diameter=None, zarr_output_path=None, store_intermediate=False, fn=<function _dummy_statistic_image>, batch_size=None, extract_image=True, fn_kwargs=mappingproxy({}), overwrite=False, **kwargs)#

Extract per-instance statistics using a user-provided callable fn.

This method constructs a Dask graph that processes each non-zero label in the mask (provided via harpy.utils.Featurizer(mask_dask_array=...)). For each labeled instance, a centered (y, x) window is extracted, with size determined by diameter or 2 * depth. If extract_image is True and an image is provided via harpy.utils.Featurizer(image_dask_array=...), a corresponding window is also extracted from the image.

When extract_image=True, pixels outside the labeled object are removed from the image window, and only the remaining pixel values are passed to the callable fn. In this case, fn receives an array of shape (c, N), where c is the number of channels and N is the number of pixels belonging to the non-zero mask for the given instance. The callable must return an array of shape (c, statistic_dimension).

Chunking along the channel dimension c is supported. If the image is chunked along this dimension, fn will be invoked separately for each channel chunk.

When extract_image=False, pixels outside the labeled object are set to zero in the mask window, and the resulting array is passed to fn. In this mode, fn receives an array of shape (z, diameter, diameter) and must return an array of shape (statistic_dimension,).

Note:

Decreasing the chunk size of the provided image and mask arrays will reduce RAM usage. A good first guess for image/mask chunking is (c_chunksize, y_chunksize, x_chunksize) = (10, 2048, 2048).

Parameters:
  • depth (int) – Passed to dask.array.map_overlap(). For correct results, choose depth to be roughly half of the estimated maximum diameter or larger.

  • statistic_dimension (int) – The dimensionality s of the statistics returned by fn. The returned Dask array will have shape (i, c, s).

  • diameter (int | None (default: None)) – Optional explicit side length of the resulting y, x window for every instance. If not provided diameter is set to 2 times depth.

  • zarr_output_path (str | Path | None (default: None)) – If a filesystem path (string or Path) is provided, the feature Dask array is computed and materialized to a Zarr store at that location. The returned object will still be a Dask array backed by the written data, but all computations necessary to populate the store will have been executed. If None (default), no data are written and the method returns a lazy (not yet computed) Dask array.

  • store_intermediate (bool (default: False)) – If True, intermediate .zarr data will be written to disk to reduce peak RAM usage. This is useful for large datasets. If zarr_output_path is not specified, it is not allowed to set store_intermediate=True.

  • fn (Callable[..., ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]] (default: <function _dummy_statistic_image at 0x742f8caa67a0>)) –

    Function applied to each extracted instance window.

    If extract_image=True, fn is called with an array of shape (c, N), where c is the number of channels and N is the number of pixels corresponding to the non-zero mask for the given instance. The function must return an array of shape (c, statistic_dimension).

    If extract_image=False, fn receives an array of shape (z, diameter, diameter), where values outside the labeled object are set to zero. In this case, the function must return an array of shape (statistic_dimension,).

    The callable may optionally accept the keyword argument run_on_gpu. If present and run_on_gpu is set to True when initializing Featurizer, the function may use GPU-specific implementations (e.g. via CuPy). If run_on_gpu is not accepted by fn, it is called without this argument and the computation backend is determined solely by the type of the input array.

  • batch_size (int | None (default: None)) –

    Number of instances processed together along the instance dimension i when evaluating the statistic computation (through fn).

    Smaller values reduce peak memory usage during function evaluation, but may increase overhead due to more frequent rechunking and task scheduling in the Dask graph.

  • extract_image (bool) – If True, extract instance windows from the image. See the description of fn for details on the expected input and output shapes.

  • fn_kwargs (dict, optional) – Additional keyword arguments passed to fn.

  • overwrite (bool (default: False)) – If set to True will overwrite the zarr store at zarr_output_path. Ignored if zarr_output_path is None.

  • **kwargs (Any) – Additional keyword arguments forwarded to dask.array.map_blocks(). Use with care.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]], Array]

Returns:

: tuple:

  • a Numpy array containing indices of extracted labels, shape (i,). Dimension of i will be equal to the total number of non-zero labels in the mask.

  • A Dask array (feature matrix) of statistics with shape (i, c, statistic_dimension). If zarr_output_path is provided, this array points to the computed Zarr store; otherwise it is lazy.

Examples

Example 1

import harpy as hp
import numpy as np
from numpy.typing import NDArray

# Load example dataset
sdata = hp.datasets.pixie_example()

# Prepare image and mask arrays
image_array = sdata["raw_image_fov0"].data[:, None, ...]
mask_array = sdata["label_whole_fov0"].data[None, ...]

def _dummy_statistic_image(array: NDArray, value: int):
    np.random.seed(42)
    # shape of array=(c, number of pixels corresponding to non zero mask for instance i)
    assert array.ndim == 2
    C = array.shape[0]
    _statistic_dimension=3
    # return dummy statistic of shape (C, statistic_dimension)
    return np.random.rand(C, _statistic_dimension) + value

# Create featurizer
featurizer = hp.utils.Featurizer(
    mask_dask_array=mask_array,
    image_dask_array=image_array,
)

value = 100
fn_kwargs = {"value": value}

# Compute instance statistics
instance_ids, calculated_statistic_lazy = featurizer.calculate_instance_statistics(
    diameter=50,
    depth=100,
    statistic_dimension=3,
    fn=_dummy_statistic_image,
    fn_kwargs=fn_kwargs,
    extract_image=True,
    batch_size=500,
)

# Execute the computation
result = calculated_statistic_lazy.compute()

Example 2

import harpy as hp
import numpy as np
from numpy.typing import NDArray

sdata=hp.datasets.pixie_example()

image_array=sdata[ "raw_image_fov0" ].data[ :, None, ... ]
mask_array=sdata[ "label_whole_fov0" ].data[ None, ... ]

def _dummy_statistic_mask( array: NDArray, value: int,)->NDArray:
    # array should be of dtype int
    assert np.issubdtype(array.dtype, np.integer)
    # array is of shape = z,y,x, with y and x the size of the instance window.
    assert array.ndim == 3
    statistic_dimension=5
    result = np.random.rand( statistic_dimension)+value
    # return array containing float of shape (statistic_dimension,)
    return result[ None, ... ]

featurizer=hp.utils.Featurizer( mask_dask_array=mask_array, image_dask_array=None )

value=100
fn_kwargs = { "value": value }

instance_ids, calculated_statistic_lazy = featurizer.calculate_instance_statistics(
    diameter=100,
    depth = 50,
    statistic_dimension=5,
    fn=_dummy_statistic_mask,
    fn_kwargs=fn_kwargs,
    extract_image=True,
    batch_size=500,
)
result = calculated_statistic_lazy.compute()
Featurizer.extract_instances(diameter, depth, remove_background=True, extract_mask=False, extract_image=True, zarr_output_path=None, name_instances_image='image.zarr', name_instances_mask='mask.zarr', batch_size=None, overwrite=False)#

Extract per-label instance windows from the mask and image of size diameter in y and x using dask.array.map_overlap() and dask.array.map_blocks().

See harpy.tb.extract_instances() for a full description.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]], Array | tuple[Array, Array]]

Returns:

: A 2-tuple (instance_ids, instances) where:

  • instance_ids is a one-dimensional NumPy array of shape (i,) containing the labels of the extracted instances. The value i is the total number of non-zero labels in the input mask. The order of instance_ids is not guaranteed to be sorted.

  • instances contains the extracted instance windows.

    • If exactly one of extract_mask or extract_image is True, this is a single Dask array with shape (i, c, z, y, x) if extract_image is True and of shape (i, 1, z, y, x) if extract_mask is True.

    • If both extract_mask and extract_image are True, this is a 2-tuple (mask_instances, image_instances) where:

      • mask_instances has shape (i, 1, z, y, x) and contains the extracted instance masks.

      • image_instances has shape (i, c, z, y, x) and contains the extracted instance image windows.

    Here, c is the number of image channels, z is the number of planes in the z-dimension, and y and x are equal to diameter.

    The returned Dask arrays are lazy unless zarr_output_path is specified, in which case the data are written to disk and reloaded as Dask arrays backed by Zarr.

Examples

Basic usage:

import harpy as hp

sdata = hp.datasets.pixie_example()

image_name = "raw_image_fov0"
labels_name = "label_whole_fov0"

mask_array = (
    sdata[labels_name]
    .data[None, ...]
    .rechunk(1024)
)

image_array = (
    sdata[image_name]
    .data[:, None, ...]
    .rechunk(1024)
)

fe = hp.utils.Featurizer(
    mask_dask_array=mask_array,
    image_dask_array=image_array,
)

# Lazy Dask graph
instance_ids, instances = fe.extract_instances(
    depth=100,
    diameter=75,
)

# Inspect shape and chunking
instances

# Persist to Zarr on disk (computes instances)
instance_ids, instances = fe.extract_instances(
    depth=100,
    diameter=75,
    zarr_output_path="instances.zarr",
)

# Keep full window content instead of masking to the instance
instance_ids, instances = fe.extract_instances(
    depth=100,
    diameter=75,
    remove_background=False,
)

Visual sanity check of extracted instances:

import dask
import dask.array as da
import matplotlib.pyplot as plt
import harpy as hp

sdata = hp.datasets.pixie_example()

labels_name = "label_whole_fov0"
mask_array = sdata[labels_name].data[None, ...]

fe = hp.utils.Featurizer(
    mask_dask_array=mask_array,
    image_dask_array=None,
)

instance_ids, instances = fe.extract_instances(
    depth=50,
    diameter=75,
    batch_size=500,
    extract_mask=True,
    extract_image=True,
)

instances = instances.compute()

instance_id = 23
mask = instances[instance_ids == instance_id][0][0][0]
plt.imshow(mask)
plt.show()

mask_array_remove = da.where(mask_array == instance_id, mask_array, 0)

_, y_, x_ = da.where(mask_array == instance_id)
y_, x_ = dask.compute(y_, x_)

plt.imshow(
    mask_array_remove[
        0,
        y_.min():y_.max(),
        x_.min():x_.max(),
    ]
)
plt.show()

See also

harpy.tb.extract_instances

Extract instance windows from a labels element and (optionally) an image element.

Featurizer.featurize(embedding_dimension, diameter, depth=None, remove_background=True, zarr_output_path=None, store_intermediate=False, model=<function _dummy_embedding>, batch_size=None, model_kwargs=mappingproxy({}), dtype=<class 'numpy.float32'>, **kwargs)#

Extract per-instance feature vectors from the image/mask using a user-provided embedding model.

See harpy.tb.featurize() for a full description.

Parameters:
  • zarr_output_path (str | Path | None (default: None)) – If a filesystem path (string or Path) is provided, the feature Dask array is computed and materialized to a Zarr store at that location. The returned object will still be a Dask array backed by the written data, but all computations necessary to populate the store will have been executed. If None (default), no data are written and the method returns a lazy (not yet computed) Dask array.

  • store_intermediate (bool (default: False)) – If True, intermediate .zarr data will be written to disk to reduce peak RAM usage. This is useful for large datasets. If zarr_output_path is not specified, it is not allowed to set store_intermediate=True. It is preferred to set store_intermediate=False, and work with a Dask client, so Dask can spill to disk.

  • **kwargs (Any) – Additional keyword arguments forwarded to dask.array.map_blocks(). Use with care.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]], Array]

Returns:

: tuple:

  • a Numpy array containing indices of extracted labels, shape (i,). Dimension of i will be equal to the total number of non-zero labels in the mask.

  • A Dask array (feature matrix) of features with shape (i, embedding_dimension). If zarr_output_path is provided, this array points to the computed Zarr store; otherwise it is lazy.

Examples

import harpy as hp
import numpy as np

sdata = hp.datasets.pixie_example()

image_name = "raw_image_fov0"
labels_name = "label_whole_fov0"

mask_array = (
    sdata[labels_name]
    .data[None, ...]
    .rechunk(1024)
)

image_array = (
    sdata[image_name]
    .data[:, None, ...]
    .rechunk(1024)
)

fe = hp.utils.Featurizer(
    mask_dask_array=mask_array,
    image_dask_array=image_array,
)

# Use a custom model with arguments
def my_model(batch, normalize: bool = True, embedding_dimension:int=64) -> np.ndarray:
    # batch: (b, c, z, y, x) -> return (b, d)
    vecs = batch.reshape(batch.shape[0], -1).astype(np.float32)
    if normalize:
        norms = np.linalg.norm(vecs, axis=1, keepdims=True) + 1e-8
        vecs = vecs / norms

    # Project to desired dimension (toy example)
    W = np.random.RandomState(0).randn(
        vecs.shape[1],
        embedding_dimension,
    ).astype(np.float32)
    return vecs @ W

# Lazy graph: generate embeddings with default dummy model
instance_ids, feats_lazy = fe.featurize(
    depth=96,
    embedding_dimension=64,
    diameter=192,
    model=my_model,
    model_kwargs={"normalize": True},
    batch_size=100,
    zarr_output_path=None,
)

# Inspect shape and chunking without computing
feats_lazy

# persist to Zarr on disk
# (this computes immediately)
instance_ids, feats = fe.featurize(
    depth=96,
    embedding_dimension=64,
    diameter=192,
    model=my_model,
    model_kwargs={"normalize": True},
    batch_size=100,
    zarr_output_path="features.zarr",
)

See also

harpy.tb.featurize

featurize instances in labels element from an image element using an embedding model.

Featurizer.quantiles(diameter, q=array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]), zarr_output_path=None, store_intermediate=False, depth=None, batch_size=None, instance_key='cell_ID', overwrite=False)#

Compute per-instance intensity quantiles.

For each labeled instance, a centered window of size diameter in the y and x dimensions is extracted, and the requested quantiles of the instance pixel intensities are computed.

Parameters:
  • diameter (int) – Estimated maximum diameter of an instance in the y and x dimensions. Used to determine the spatial extent of the extracted window.

  • q (float | list[float] | ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] (default: array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]))) – Quantile or quantiles to compute. May be a single float in the interval [0, 1] or a sequence of floats. By default, nine evenly spaced quantiles between 0.1 and 0.9 are computed.

  • zarr_output_path (str | Path | None (default: None)) – Filesystem path (string or Path) where the feature Dask array is written as a Zarr store. If provided, the Dask graph is computed and the data are materialized to disk at this location. The returned object is still a Dask array backed by the written data. If None (default), no data are written and the method returns a lazy (not yet computed) Dask array.

  • store_intermediate (bool (default: False)) – If True, intermediate .zarr data will be written to disk to reduce peak RAM usage. This is useful for large datasets. If zarr_output_path is not specified, it is not allowed to set store_intermediate=True.

  • depth (int | None (default: None)) – Passed to dask.array.map_overlap(). For correct results, choose depth to be roughly half of the estimated maximum diameter or larger.

  • batch_size (int | None (default: None)) – Number of instances processed together during computation. Smaller values reduce peak memory usage at the cost of increased overhead.

  • instance_key (str (default: 'cell_ID')) – Name of the column in the output DataFrames that contains the identifier of each instance, matching the corresponding label value in the mask.

  • overwrite (bool (default: False)) – If set to True will overwrite the zarr store at zarr_output_path. Ignored if zarr_output_path is None.

Return type:

list[DataFrame] | tuple[ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]], Array]

Returns:

: If zarr_output_path is None, returns a list of pandas DataFrames, one per requested quantile. Each DataFrame contains per-instance quantile values and is indexed by the instance identifier.

If zarr_output_path is specified, returns a tuple (instance_ids, quantiles), where instance_ids is a NumPy array of instance identifiers and quantiles is a Dask array backed by the written Zarr store. In this case, all computations required to populate the Zarr store are executed.

Notes

The computation is performed lazily and may be executed in parallel using dask. Memory usage can be controlled via batch_size.

Featurizer.radii_and_principal_axes(diameter, calculate_axes=True, depth=None, batch_size=None, instance_key='cell_ID')#

Compute per-instance radii and principal axes.

For each labeled instance, a centered window of size diameter in the y and x dimensions is extracted from the mask. The spatial extent of each instance is used to compute its radii (sorted from largest to smallest), and optionally its principal axes.

Parameters:
  • diameter (int) – Estimated maximum diameter of an instance in the y and x dimensions. Used to determine the spatial extent of the extracted window.

  • calculate_axes (bool (default: True)) – If True, compute and include the principal axes for each instance. When enabled, the axes are returned as a flattened 3 × 3 matrix (nine columns) per instance.

  • depth (int | None (default: None)) – Passed to dask.array.map_overlap(). For correct results, choose depth to be roughly half of the estimated maximum diameter or larger.

  • batch_size (int | None (default: None)) – Number of instances processed together during computation. Smaller values reduce peak memory usage at the cost of increased overhead.

  • instance_key (str (default: 'cell_ID')) – Name of the column in the output DataFrame that contains the identifier of each instance, matching the corresponding label value in the mask.

Return type:

DataFrame

Returns:

: A DataFrame where each row corresponds to a single instance and rows are ordered by instance identifier. The DataFrame contains:

  • One column with the instance (label) identifier.

  • Three columns containing the radii, sorted from largest to smallest.

  • If calculate_axes is True, nine additional columns containing the flattened 3 × 3 principal axes matrix.

Notes

The computation is performed lazily and may be executed in parallel using dask. Memory usage can be controlled via batch_size.