harpy.tb.ZarrIterableInstances

harpy.tb.ZarrIterableInstances#

class harpy.tb.ZarrIterableInstances(zarr_path, instance_ids, labels=None, chunk_i=None, shuffle_chunks=False, chunk_seed=0, shuffle_within_chunk=False, buffer_seed=None, normalize='minmax01', x_dtype=torch.float32, return_instance_id=True, return_row_index=True, drop_unlabeled=True, allowed_chunk_indexes=None)#

Chunk-wise iterable dataset that:

iterates chunk-by-chunk
optionally shuffles chunks deterministically per epoch
partitions chunks across DDP ranks and dataloader workers
supervised training (labels provided): yields (x, y) with optional instance_id/row_idx
inference (no labels): yields x with optional instance_id/row_idx

Inspired by https://gitlab.in2p3.fr/ipsl/espri/espri-ia/projects/zarr-torch-dataset

Parameters:

zarr_path (str) – Path to a Zarr array of shape (i, c, z, y, x) where i indexes instances.
instance_ids (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – One-dimensional array of length i with the instance identifier per row in the Zarr array.
labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None (default: None)) – Optional one-dimensional array of length i with supervised labels. If None, the dataset operates in inference mode.
chunk_i (int | None (default: None)) – Chunk size along the instance axis. If None, uses the Zarr chunk size for axis 0.
shuffle_chunks (bool (default: False)) – If True, shuffle chunk order deterministically by chunk_seed.
chunk_seed (int (default: 0)) – Base seed used for shuffling chunks. Combined with epoch when shuffle_chunks is True.
shuffle_within_chunk (bool (default: False)) – If True, shuffle rows within each chunk.
buffer_seed (int | None (default: None)) – Base seed for within-chunk shuffling. If None, within-chunk shuffling is non-deterministic.
normalize (str (default: 'minmax01')) – Normalization mode for each instance tensor. Supported values are "minmax01", "unit", None or "none".
x_dtype (dtype (default: torch.float32)) – Output dtype for returned tensors.
return_instance_id (bool (default: True)) – If True, append the instance id to each output tuple.
return_row_index (bool (default: True)) – If True, append the global row index (=instance id index) to each output tuple.
drop_unlabeled (bool (default: True)) – If True and labels is provided, drop rows with label < 0.
allowed_chunk_indexes (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] (default: None)) – Optional subset of chunk indices to iterate, e.g. for train/test splits.

Methods table#

set_epoch(epoch)

Call this once per epoch.

Methods#

ZarrIterableInstances.set_epoch(epoch)#