harpy.utils.Featurizer

harpy.utils.Featurizer#

class harpy.utils.Featurizer(mask_dask_array, image_dask_array=None, run_on_gpu=True)#

Helper class to featurize images and labels using Dask.

Parameters:

mask_dask_array (Array) – A 3D Dask array of integer labels representing segmented regions. Expected shape is (‘z’, ‘y’, ‘x’). Each unique integer value represents a separate label.
image_dask_array (Array | None (default: None)) – A 4D Dask array representing the image data with shape (‘c’, ‘z’, ‘y’, ‘x’), where ‘c’ is the number of channels. Can be None if only mask-based computations (e.g., count or center of mass) are required.

Raises:

ValueError – If mask_dask_array does not contain an integer dtype.
AssertionError – If mask_dask_array is not 3D.
AssertionError – If image_dask_array is is not 4D.
AssertionError – If spatial dimensions of image_dask_array and mask_dask_array do not match.
AssertionError – If chunk sizes of spatial dimensions do not match between image and mask.

Notes

Unique labels are computed once at initialization for efficiency.
Image and mask must be aligned in spatial dimensions and chunking to ensure accurate and efficient featurization.

Methods table#

`calculate_instance_statistics`(depth, ...[, ...])	Extract per-instance statistics using a user-provided callable `fn`.
`extract_instances`(diameter, depth[, ...])	Extract per-label instance windows from the mask and image of size `diameter` in `y` and `x` using `dask.array.map_overlap()` and `dask.array.map_blocks()`.
`featurize`(embedding_dimension, diameter[, ...])	Extract per-instance feature vectors from the image/mask using a user-provided embedding `model`.
`quantiles`(diameter[, q, zarr_output_path, ...])	Compute per-instance intensity quantiles.
`radii_and_principal_axes`(diameter[, ...])	Compute per-instance radii and principal axes.

Methods#

Featurizer.calculate_instance_statistics(depth, statistic_dimension, diameter=None, zarr_output_path=None, store_intermediate=False, fn=<function _dummy_statistic_image>, batch_size=None, extract_image=True, fn_kwargs=mappingproxy({}), overwrite=False, **kwargs)#

Extract per-instance statistics using a user-provided callable fn.

This method constructs a Dask graph that processes each non-zero label in the mask (provided via harpy.utils.Featurizer(mask_dask_array=...)). For each labeled instance, a centered (y, x) window is extracted, with size determined by diameter or 2 * depth. If extract_image is True and an image is provided via harpy.utils.Featurizer(image_dask_array=...), a corresponding window is also extracted from the image.

When extract_image=True, pixels outside the labeled object are removed from the image window, and only the remaining pixel values are passed to the callable fn. In this case, fn receives an array of shape (c, N), where c is the number of channels and N is the number of pixels belonging to the non-zero mask for the given instance. The callable must return an array of shape (c, statistic_dimension).

Chunking along the channel dimension c is supported. If the image is chunked along this dimension, fn will be invoked separately for each channel chunk.

When extract_image=False, pixels outside the labeled object are set to zero in the mask window, and the resulting array is passed to fn. In this mode, fn receives an array of shape (z, diameter, diameter) and must return an array of shape (statistic_dimension,).

Note:: Decreasing the chunk size of the provided image and mask arrays will reduce RAM usage. A good first guess for image/mask chunking is (c_chunksize, y_chunksize, x_chunksize) = (10, 2048, 2048).

Parameters:

depth (int) – Passed to dask.array.map_overlap(). For correct results, choose depth to be roughly half of the estimated maximum diameter or larger.
statistic_dimension (int) – The dimensionality s of the statistics returned by fn. The returned Dask array will have shape (i, c, s).
diameter (int | None (default: None)) – Optional explicit side length of the resulting y, x window for every instance. If not provided diameter is set to 2 times depth.
zarr_output_path (str | Path | None (default: None)) – If a filesystem path (string or Path) is provided, the feature Dask array is computed and materialized to a Zarr store at that location. The returned object will still be a Dask array backed by the written data, but all computations necessary to populate the store will have been executed. If None (default), no data are written and the method returns a lazy (not yet computed) Dask array.
store_intermediate (bool (default: False)) – If True, intermediate .zarr data will be written to disk to reduce peak RAM usage. This is useful for large datasets. If zarr_output_path is not specified, it is not allowed to set store_intermediate=True.
fn (Callable[..., ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]] (default: <function _dummy_statistic_image at 0x7bef30e57f60>)) –
Function applied to each extracted instance window.

If extract_image=True, fn is called with an array of shape (c, N), where c is the number of channels and N is the number of pixels corresponding to the non-zero mask for the given instance. The function must return an array of shape (c, statistic_dimension).

If extract_image=False, fn receives an array of shape (z, diameter, diameter), where values outside the labeled object are set to zero. In this case, the function must return an array of shape (statistic_dimension,).

The callable may optionally accept the keyword argument run_on_gpu. If present and run_on_gpu is set to True when initializing Featurizer, the function may use GPU-specific implementations (e.g. via CuPy). If run_on_gpu is not accepted by fn, it is called without this argument and the computation backend is determined solely by the type of the input array.
batch_size (int | None (default: None)) –
Number of instances processed together along the instance dimension i when evaluating the statistic computation (through fn).

Smaller values reduce peak memory usage during function evaluation, but may increase overhead due to more frequent rechunking and task scheduling in the Dask graph.
extract_image (bool (default: True)) – If True, extract instance windows from the image. See the description of fn for details on the expected input and output shapes.
fn_kwargs (Mapping[str, Any] (default: mappingproxy({}))) – Additional keyword arguments passed to fn.
overwrite (bool (default: False)) – If set to True will overwrite the zarr store at zarr_output_path. Ignored if zarr_output_path is None.
**kwargs (Any) – Additional keyword arguments forwarded to dask.array.map_blocks(). Use with care.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]], Array]

Returns:

: tuple:

a Numpy array containing indices of extracted labels, shape (i,). Dimension of i will be equal to the total number of non-zero labels in the mask.

A Dask array (feature matrix) of statistics with shape (i, c, statistic_dimension). If zarr_output_path is provided, this array points to the computed Zarr store; otherwise it is lazy.

Examples

Example 1

import harpy as hp
import numpy as np
from numpy.typing import NDArray

# Load example dataset
sdata = hp.datasets.pixie_example()

# Prepare image and mask arrays
image_array = sdata["raw_image_fov0"].data[:, None, ...]
mask_array = sdata["label_whole_fov0"].data[None, ...]

def _dummy_statistic_image(array: NDArray, value: int):
    np.random.seed(42)
    # shape of array=(c, number of pixels corresponding to non zero mask for instance i)
    assert array.ndim == 2
    C = array.shape[0]
    _statistic_dimension=3
    # return dummy statistic of shape (C, statistic_dimension)
    return np.random.rand(C, _statistic_dimension) + value

# Create featurizer
featurizer = hp.utils.Featurizer(
    mask_dask_array=mask_array,
    image_dask_array=image_array,
)

value = 100
fn_kwargs = {"value": value}

# Compute instance statistics
instance_ids, calculated_statistic_lazy = featurizer.calculate_instance_statistics(
    diameter=50,
    depth=100,
    statistic_dimension=3,
    fn=_dummy_statistic_image,
    fn_kwargs=fn_kwargs,
    extract_image=True,
    batch_size=500,
)

# Execute the computation
result = calculated_statistic_lazy.compute()

Example 2

import harpy as hp
import numpy as np
from numpy.typing import NDArray

sdata=hp.datasets.pixie_example()

image_array=sdata[ "raw_image_fov0" ].data[ :, None, ... ]
mask_array=sdata[ "label_whole_fov0" ].data[ None, ... ]

def _dummy_statistic_mask( array: NDArray, value: int,)->NDArray:
    # array should be of dtype int
    assert np.issubdtype(array.dtype, np.integer)
    # array is of shape = z,y,x, with y and x the size of the instance window.
    assert array.ndim == 3
    statistic_dimension=5
    result = np.random.rand( statistic_dimension)+value
    # return array containing float of shape (statistic_dimension,)
    return result[ None, ... ]

featurizer=hp.utils.Featurizer( mask_dask_array=mask_array, image_dask_array=None )

value=100
fn_kwargs = { "value": value }

instance_ids, calculated_statistic_lazy = featurizer.calculate_instance_statistics(
    diameter=100,
    depth = 50,
    statistic_dimension=5,
    fn=_dummy_statistic_mask,
    fn_kwargs=fn_kwargs,
    extract_image=True,
    batch_size=500,
)
result = calculated_statistic_lazy.compute()

Featurizer.extract_instances(diameter, depth, remove_background=True, extract_mask=False, extract_image=True, zarr_output_path=None, name_instances_image='image.zarr', name_instances_mask='mask.zarr', batch_size=None, overwrite=False)#

Extract per-label instance windows from the mask and image of size diameter in y and x using dask.array.map_overlap() and dask.array.map_blocks().

See harpy.tb.extract_instances() for a full description.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]], Array | tuple[Array, Array]]

Returns:

: A 2-tuple (instance_ids, instances) where:

instance_ids is a one-dimensional NumPy array of shape (i,) containing the labels of the extracted instances. The value i is the total number of non-zero labels in the input mask. The order of instance_ids is not guaranteed to be sorted.
instances contains the extracted instance windows.
- If exactly one of extract_mask or extract_image is True, this is a single Dask array with shape (i, c, z, y, x) if extract_image is True and of shape (i, 1, z, y, x) if extract_mask is True.
- If both extract_mask and extract_image are True, this is a 2-tuple (mask_instances, image_instances) where:
  - mask_instances has shape (i, 1, z, y, x) and contains the extracted instance masks.
  - image_instances has shape (i, c, z, y, x) and contains the extracted instance image windows.
Here, c is the number of image channels, z is the number of planes in the z-dimension, and y and x are equal to diameter.

The returned Dask arrays are lazy unless zarr_output_path is specified, in which case the data are written to disk and reloaded as Dask arrays backed by Zarr.

Examples

Basic usage:

import harpy as hp

sdata = hp.datasets.pixie_example()

image_name = "raw_image_fov0"
labels_name = "label_whole_fov0"

mask_array = (
    sdata[labels_name]
    .data[None, ...]
    .rechunk(1024)
)

image_array = (
    sdata[image_name]
    .data[:, None, ...]
    .rechunk(1024)
)

fe = hp.utils.Featurizer(
    mask_dask_array=mask_array,
    image_dask_array=image_array,
)

# Lazy Dask graph
instance_ids, instances = fe.extract_instances(
    depth=100,
    diameter=75,
)

# Inspect shape and chunking
instances

# Persist to Zarr on disk (computes instances)
instance_ids, instances = fe.extract_instances(
    depth=100,
    diameter=75,
    zarr_output_path="instances.zarr",
)

# Keep full window content instead of masking to the instance
instance_ids, instances = fe.extract_instances(
    depth=100,
    diameter=75,
    remove_background=False,
)

Visual sanity check of extracted instances:

import dask
import dask.array as da
import matplotlib.pyplot as plt
import harpy as hp

sdata = hp.datasets.pixie_example()

labels_name = "label_whole_fov0"
mask_array = sdata[labels_name].data[None, ...]

fe = hp.utils.Featurizer(
    mask_dask_array=mask_array,
    image_dask_array=None,
)

instance_ids, instances = fe.extract_instances(
    depth=50,
    diameter=75,
    batch_size=500,
    extract_mask=True,
    extract_image=True,
)

instances = instances.compute()

instance_id = 23
mask = instances[instance_ids == instance_id][0][0][0]
plt.imshow(mask)
plt.show()

mask_array_remove = da.where(mask_array == instance_id, mask_array, 0)

_, y_, x_ = da.where(mask_array == instance_id)
y_, x_ = dask.compute(y_, x_)

plt.imshow(
    mask_array_remove[
        0,
        y_.min():y_.max(),
        x_.min():x_.max(),
    ]
)
plt.show()

harpy.utils.Featurizer

Contents

harpy.utils.Featurizer#

Methods table#

Methods#