harpy.utils.Featurizer#
- class harpy.utils.Featurizer(mask_dask_array, image_dask_array=None, run_on_gpu=True)#
Helper class to featurize images and labels using Dask.
- Parameters:
mask_dask_array (
Array) – A 3D Dask array of integer labels representing segmented regions. Expected shape is (‘z’, ‘y’, ‘x’). Each unique integer value represents a separate label.image_dask_array (
Array|None(default:None)) – A 4D Dask array representing the image data with shape (‘c’, ‘z’, ‘y’, ‘x’), where ‘c’ is the number of channels. Can beNoneif only mask-based computations (e.g., count or center of mass) are required.
- Raises:
ValueError – If
mask_dask_arraydoes not contain an integer dtype.AssertionError – If
mask_dask_arrayis not 3D.AssertionError – If
image_dask_arrayis is not 4D.AssertionError – If spatial dimensions of
image_dask_arrayandmask_dask_arraydo not match.AssertionError – If chunk sizes of spatial dimensions do not match between image and mask.
Notes
Unique labels are computed once at initialization for efficiency.
Image and mask must be aligned in spatial dimensions and chunking to ensure accurate and efficient featurization.
Methods table#
|
Extract per-instance statistics using a user-provided callable |
|
Extract per-label instance windows from the mask and image of size |
|
Extract per-instance feature vectors from the image/mask using a user-provided embedding |
|
Compute per-instance intensity quantiles. |
|
Compute per-instance radii and principal axes. |
Methods#
- Featurizer.calculate_instance_statistics(depth, statistic_dimension, diameter=None, zarr_output_path=None, store_intermediate=False, fn=<function _dummy_statistic_image>, batch_size=None, extract_image=True, fn_kwargs=mappingproxy({}), overwrite=False, **kwargs)#
Extract per-instance statistics using a user-provided callable
fn.This method constructs a Dask graph that processes each non-zero label in the mask (provided via
harpy.utils.Featurizer(mask_dask_array=...)). For each labeled instance, a centered(y, x)window is extracted, with size determined bydiameteror2 * depth. Ifextract_imageis True and an image is provided viaharpy.utils.Featurizer(image_dask_array=...), a corresponding window is also extracted from the image.When
extract_image=True, pixels outside the labeled object are removed from the image window, and only the remaining pixel values are passed to the callablefn. In this case,fnreceives an array of shape(c, N), wherecis the number of channels andNis the number of pixels belonging to the non-zero mask for the given instance. The callable must return an array of shape(c, statistic_dimension).Chunking along the channel dimension
cis supported. If the image is chunked along this dimension,fnwill be invoked separately for each channel chunk.When
extract_image=False, pixels outside the labeled object are set to zero in the mask window, and the resulting array is passed tofn. In this mode,fnreceives an array of shape(z, diameter, diameter)and must return an array of shape(statistic_dimension,).- Note:
Decreasing the chunk size of the provided image and mask arrays will reduce RAM usage. A good first guess for image/mask chunking is
(c_chunksize, y_chunksize, x_chunksize) = (10, 2048, 2048).
- Parameters:
depth (
int) – Passed todask.array.map_overlap(). For correct results, choose depth to be roughly half of the estimated maximum diameter or larger.statistic_dimension (
int) – The dimensionalitysof the statistics returned byfn. The returned Dask array will have shape(i, c, s).diameter (
int|None(default:None)) – Optional explicit side length of the resultingy,xwindow for every instance. If not provideddiameteris set to 2 timesdepth.zarr_output_path (
str|Path|None(default:None)) – If a filesystem path (string orPath) is provided, the feature Dask array is computed and materialized to a Zarr store at that location. The returned object will still be a Dask array backed by the written data, but all computations necessary to populate the store will have been executed. IfNone(default), no data are written and the method returns a lazy (not yet computed) Dask array.store_intermediate (
bool(default:False)) – IfTrue, intermediate.zarrdata will be written to disk to reduce peak RAM usage. This is useful for large datasets. Ifzarr_output_pathis not specified, it is not allowed to setstore_intermediate=True.fn (
Callable[...,ndarray[tuple[Any,...],dtype[TypeVar(_ScalarT, bound=generic)]]] (default:<function _dummy_statistic_image at 0x742f8caa67a0>)) –Function applied to each extracted instance window.
If
extract_image=True,fnis called with an array of shape(c, N), wherecis the number of channels andNis the number of pixels corresponding to the non-zero mask for the given instance. The function must return an array of shape(c, statistic_dimension).If
extract_image=False,fnreceives an array of shape(z, diameter, diameter), where values outside the labeled object are set to zero. In this case, the function must return an array of shape(statistic_dimension,).The callable may optionally accept the keyword argument
run_on_gpu. If present andrun_on_gpuis set toTruewhen initializingFeaturizer, the function may use GPU-specific implementations (e.g. via CuPy). Ifrun_on_gpuis not accepted byfn, it is called without this argument and the computation backend is determined solely by the type of the input array.batch_size (
int|None(default:None)) –Number of instances processed together along the instance dimension
iwhen evaluating the statistic computation (throughfn).Smaller values reduce peak memory usage during function evaluation, but may increase overhead due to more frequent rechunking and task scheduling in the Dask graph.
extract_image (bool) – If True, extract instance windows from the image. See the description of
fnfor details on the expected input and output shapes.fn_kwargs (dict, optional) – Additional keyword arguments passed to
fn.overwrite (
bool(default:False)) – If set toTruewill overwrite the zarr store atzarr_output_path. Ignored ifzarr_output_pathisNone.**kwargs (
Any) – Additional keyword arguments forwarded todask.array.map_blocks(). Use with care.
- Return type:
tuple[ndarray[tuple[Any,...],dtype[TypeVar(_ScalarT, bound=generic)]],Array]- Returns:
: tuple:
a Numpy array containing indices of extracted labels, shape
(i,). Dimension ofiwill be equal to the total number of non-zero labels in the mask.A Dask array (feature matrix) of statistics with shape
(i, c, statistic_dimension). Ifzarr_output_pathis provided, this array points to the computed Zarr store; otherwise it is lazy.
Examples
Example 1
import harpy as hp import numpy as np from numpy.typing import NDArray # Load example dataset sdata = hp.datasets.pixie_example() # Prepare image and mask arrays image_array = sdata["raw_image_fov0"].data[:, None, ...] mask_array = sdata["label_whole_fov0"].data[None, ...] def _dummy_statistic_image(array: NDArray, value: int): np.random.seed(42) # shape of array=(c, number of pixels corresponding to non zero mask for instance i) assert array.ndim == 2 C = array.shape[0] _statistic_dimension=3 # return dummy statistic of shape (C, statistic_dimension) return np.random.rand(C, _statistic_dimension) + value # Create featurizer featurizer = hp.utils.Featurizer( mask_dask_array=mask_array, image_dask_array=image_array, ) value = 100 fn_kwargs = {"value": value} # Compute instance statistics instance_ids, calculated_statistic_lazy = featurizer.calculate_instance_statistics( diameter=50, depth=100, statistic_dimension=3, fn=_dummy_statistic_image, fn_kwargs=fn_kwargs, extract_image=True, batch_size=500, ) # Execute the computation result = calculated_statistic_lazy.compute()
Example 2
import harpy as hp import numpy as np from numpy.typing import NDArray sdata=hp.datasets.pixie_example() image_array=sdata[ "raw_image_fov0" ].data[ :, None, ... ] mask_array=sdata[ "label_whole_fov0" ].data[ None, ... ] def _dummy_statistic_mask( array: NDArray, value: int,)->NDArray: # array should be of dtype int assert np.issubdtype(array.dtype, np.integer) # array is of shape = z,y,x, with y and x the size of the instance window. assert array.ndim == 3 statistic_dimension=5 result = np.random.rand( statistic_dimension)+value # return array containing float of shape (statistic_dimension,) return result[ None, ... ] featurizer=hp.utils.Featurizer( mask_dask_array=mask_array, image_dask_array=None ) value=100 fn_kwargs = { "value": value } instance_ids, calculated_statistic_lazy = featurizer.calculate_instance_statistics( diameter=100, depth = 50, statistic_dimension=5, fn=_dummy_statistic_mask, fn_kwargs=fn_kwargs, extract_image=True, batch_size=500, ) result = calculated_statistic_lazy.compute()
- Featurizer.extract_instances(diameter, depth, remove_background=True, extract_mask=False, extract_image=True, zarr_output_path=None, name_instances_image='image.zarr', name_instances_mask='mask.zarr', batch_size=None, overwrite=False)#
Extract per-label instance windows from the mask and image of size
diameterinyandxusingdask.array.map_overlap()anddask.array.map_blocks().See
harpy.tb.extract_instances()for a full description.- Return type:
tuple[ndarray[tuple[Any,...],dtype[TypeVar(_ScalarT, bound=generic)]],Array|tuple[Array,Array]]- Returns:
: A 2-tuple
(instance_ids, instances)where:instance_idsis a one-dimensional NumPy array of shape(i,)containing the labels of the extracted instances. The valueiis the total number of non-zero labels in the input mask. The order ofinstance_idsis not guaranteed to be sorted.instancescontains the extracted instance windows.If exactly one of
extract_maskorextract_imageisTrue, this is a single Dask array with shape(i, c, z, y, x)ifextract_imageisTrueand of shape(i, 1, z, y, x)ifextract_maskisTrue.If both
extract_maskandextract_imageareTrue, this is a 2-tuple(mask_instances, image_instances)where:mask_instanceshas shape(i, 1, z, y, x)and contains the extracted instance masks.image_instanceshas shape(i, c, z, y, x)and contains the extracted instance image windows.
Here,
cis the number of image channels,zis the number of planes in the z-dimension, andyandxare equal todiameter.The returned Dask arrays are lazy unless
zarr_output_pathis specified, in which case the data are written to disk and reloaded as Dask arrays backed by Zarr.
Examples
Basic usage:
import harpy as hp sdata = hp.datasets.pixie_example() image_name = "raw_image_fov0" labels_name = "label_whole_fov0" mask_array = ( sdata[labels_name] .data[None, ...] .rechunk(1024) ) image_array = ( sdata[image_name] .data[:, None, ...] .rechunk(1024) ) fe = hp.utils.Featurizer( mask_dask_array=mask_array, image_dask_array=image_array, ) # Lazy Dask graph instance_ids, instances = fe.extract_instances( depth=100, diameter=75, ) # Inspect shape and chunking instances # Persist to Zarr on disk (computes instances) instance_ids, instances = fe.extract_instances( depth=100, diameter=75, zarr_output_path="instances.zarr", ) # Keep full window content instead of masking to the instance instance_ids, instances = fe.extract_instances( depth=100, diameter=75, remove_background=False, )
Visual sanity check of extracted instances:
import dask import dask.array as da import matplotlib.pyplot as plt import harpy as hp sdata = hp.datasets.pixie_example() labels_name = "label_whole_fov0" mask_array = sdata[labels_name].data[None, ...] fe = hp.utils.Featurizer( mask_dask_array=mask_array, image_dask_array=None, ) instance_ids, instances = fe.extract_instances( depth=50, diameter=75, batch_size=500, extract_mask=True, extract_image=True, ) instances = instances.compute() instance_id = 23 mask = instances[instance_ids == instance_id][0][0][0] plt.imshow(mask) plt.show() mask_array_remove = da.where(mask_array == instance_id, mask_array, 0) _, y_, x_ = da.where(mask_array == instance_id) y_, x_ = dask.compute(y_, x_) plt.imshow( mask_array_remove[ 0, y_.min():y_.max(), x_.min():x_.max(), ] ) plt.show()
See also
harpy.tb.extract_instancesExtract instance windows from a labels element and (optionally) an image element.
- Featurizer.featurize(embedding_dimension, diameter, depth=None, remove_background=True, zarr_output_path=None, store_intermediate=False, model=<function _dummy_embedding>, batch_size=None, model_kwargs=mappingproxy({}), dtype=<class 'numpy.float32'>, **kwargs)#
Extract per-instance feature vectors from the image/mask using a user-provided embedding
model.See
harpy.tb.featurize()for a full description.- Parameters:
zarr_output_path (
str|Path|None(default:None)) – If a filesystem path (string orPath) is provided, the feature Dask array is computed and materialized to a Zarr store at that location. The returned object will still be a Dask array backed by the written data, but all computations necessary to populate the store will have been executed. IfNone(default), no data are written and the method returns a lazy (not yet computed) Dask array.store_intermediate (
bool(default:False)) – IfTrue, intermediate.zarrdata will be written to disk to reduce peak RAM usage. This is useful for large datasets. Ifzarr_output_pathis not specified, it is not allowed to setstore_intermediate=True. It is preferred to setstore_intermediate=False, and work with a Dask client, so Dask can spill to disk.**kwargs (
Any) – Additional keyword arguments forwarded todask.array.map_blocks(). Use with care.
- Return type:
tuple[ndarray[tuple[Any,...],dtype[TypeVar(_ScalarT, bound=generic)]],Array]- Returns:
: tuple:
a Numpy array containing indices of extracted labels, shape
(i,). Dimension ofiwill be equal to the total number of non-zero labels in the mask.A Dask array (feature matrix) of features with shape
(i, embedding_dimension). Ifzarr_output_pathis provided, this array points to the computed Zarr store; otherwise it is lazy.
Examples
import harpy as hp import numpy as np sdata = hp.datasets.pixie_example() image_name = "raw_image_fov0" labels_name = "label_whole_fov0" mask_array = ( sdata[labels_name] .data[None, ...] .rechunk(1024) ) image_array = ( sdata[image_name] .data[:, None, ...] .rechunk(1024) ) fe = hp.utils.Featurizer( mask_dask_array=mask_array, image_dask_array=image_array, ) # Use a custom model with arguments def my_model(batch, normalize: bool = True, embedding_dimension:int=64) -> np.ndarray: # batch: (b, c, z, y, x) -> return (b, d) vecs = batch.reshape(batch.shape[0], -1).astype(np.float32) if normalize: norms = np.linalg.norm(vecs, axis=1, keepdims=True) + 1e-8 vecs = vecs / norms # Project to desired dimension (toy example) W = np.random.RandomState(0).randn( vecs.shape[1], embedding_dimension, ).astype(np.float32) return vecs @ W # Lazy graph: generate embeddings with default dummy model instance_ids, feats_lazy = fe.featurize( depth=96, embedding_dimension=64, diameter=192, model=my_model, model_kwargs={"normalize": True}, batch_size=100, zarr_output_path=None, ) # Inspect shape and chunking without computing feats_lazy # persist to Zarr on disk # (this computes immediately) instance_ids, feats = fe.featurize( depth=96, embedding_dimension=64, diameter=192, model=my_model, model_kwargs={"normalize": True}, batch_size=100, zarr_output_path="features.zarr", )
See also
harpy.tb.featurizefeaturize instances in labels element from an image element using an embedding model.
- Featurizer.quantiles(diameter, q=array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]), zarr_output_path=None, store_intermediate=False, depth=None, batch_size=None, instance_key='cell_ID', overwrite=False)#
Compute per-instance intensity quantiles.
For each labeled instance, a centered window of size
diameterin theyandxdimensions is extracted, and the requested quantiles of the instance pixel intensities are computed.- Parameters:
diameter (
int) – Estimated maximum diameter of an instance in theyandxdimensions. Used to determine the spatial extent of the extracted window.q (
float|list[float] |ndarray[tuple[Any,...],dtype[TypeVar(_ScalarT, bound=generic)]] (default:array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]))) – Quantile or quantiles to compute. May be a single float in the interval[0, 1]or a sequence of floats. By default, nine evenly spaced quantiles between 0.1 and 0.9 are computed.zarr_output_path (
str|Path|None(default:None)) – Filesystem path (string orPath) where the feature Dask array is written as a Zarr store. If provided, the Dask graph is computed and the data are materialized to disk at this location. The returned object is still a Dask array backed by the written data. IfNone(default), no data are written and the method returns a lazy (not yet computed) Dask array.store_intermediate (
bool(default:False)) – IfTrue, intermediate.zarrdata will be written to disk to reduce peak RAM usage. This is useful for large datasets. Ifzarr_output_pathis not specified, it is not allowed to setstore_intermediate=True.depth (
int|None(default:None)) – Passed todask.array.map_overlap(). For correct results, choose depth to be roughly half of the estimated maximum diameter or larger.batch_size (
int|None(default:None)) – Number of instances processed together during computation. Smaller values reduce peak memory usage at the cost of increased overhead.instance_key (
str(default:'cell_ID')) – Name of the column in the output DataFrames that contains the identifier of each instance, matching the corresponding label value in the mask.overwrite (
bool(default:False)) – If set toTruewill overwrite the zarr store atzarr_output_path. Ignored ifzarr_output_pathisNone.
- Return type:
list[DataFrame] |tuple[ndarray[tuple[Any,...],dtype[TypeVar(_ScalarT, bound=generic)]],Array]- Returns:
: If
zarr_output_pathisNone, returns a list of pandas DataFrames, one per requested quantile. Each DataFrame contains per-instance quantile values and is indexed by the instance identifier.If
zarr_output_pathis specified, returns a tuple(instance_ids, quantiles), whereinstance_idsis a NumPy array of instance identifiers andquantilesis a Dask array backed by the written Zarr store. In this case, all computations required to populate the Zarr store are executed.
Notes
The computation is performed lazily and may be executed in parallel using
dask. Memory usage can be controlled viabatch_size.
- Featurizer.radii_and_principal_axes(diameter, calculate_axes=True, depth=None, batch_size=None, instance_key='cell_ID')#
Compute per-instance radii and principal axes.
For each labeled instance, a centered window of size
diameterin theyandxdimensions is extracted from the mask. The spatial extent of each instance is used to compute its radii (sorted from largest to smallest), and optionally its principal axes.- Parameters:
diameter (
int) – Estimated maximum diameter of an instance in theyandxdimensions. Used to determine the spatial extent of the extracted window.calculate_axes (
bool(default:True)) – If True, compute and include the principal axes for each instance. When enabled, the axes are returned as a flattened3 × 3matrix (nine columns) per instance.depth (
int|None(default:None)) – Passed todask.array.map_overlap(). For correct results, choosedepthto be roughly half of the estimated maximum diameter or larger.batch_size (
int|None(default:None)) – Number of instances processed together during computation. Smaller values reduce peak memory usage at the cost of increased overhead.instance_key (
str(default:'cell_ID')) – Name of the column in the output DataFrame that contains the identifier of each instance, matching the corresponding label value in the mask.
- Return type:
DataFrame- Returns:
: A DataFrame where each row corresponds to a single instance and rows are ordered by instance identifier. The DataFrame contains:
One column with the instance (label) identifier.
Three columns containing the radii, sorted from largest to smallest.
If
calculate_axesis True, nine additional columns containing the flattened3 × 3principal axes matrix.
Notes
The computation is performed lazily and may be executed in parallel using
dask. Memory usage can be controlled viabatch_size.