harpy.tb.ZarrIterableInstances#

class harpy.tb.ZarrIterableInstances(zarr_path, instance_ids, labels=None, chunk_i=None, shuffle_chunks=False, chunk_seed=0, shuffle_within_chunk=False, buffer_seed=None, normalize='minmax01', x_dtype=torch.float32, return_instance_id=True, return_row_index=True, drop_unlabeled=True, allowed_chunk_indexes=None)#

Chunk-wise iterable dataset that:

  • iterates chunk-by-chunk

  • optionally shuffles chunks deterministically per epoch

  • partitions chunks across DDP ranks and dataloader workers

  • supervised training (labels provided): yields (x, y) with optional instance_id/row_idx

  • inference (no labels): yields x with optional instance_id/row_idx

Inspired by https://gitlab.in2p3.fr/ipsl/espri/espri-ia/projects/zarr-torch-dataset

Parameters:
  • zarr_path (str) – Path to a Zarr array of shape (i, c, z, y, x) where i indexes instances.

  • instance_ids (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]]) – One-dimensional array of length i with the instance identifier per row in the Zarr array.

  • labels (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None (default: None)) – Optional one-dimensional array of length i with supervised labels. If None, the dataset operates in inference mode.

  • chunk_i (int | None (default: None)) – Chunk size along the instance axis. If None, uses the Zarr chunk size for axis 0.

  • shuffle_chunks (bool (default: False)) – If True, shuffle chunk order deterministically by chunk_seed.

  • chunk_seed (int (default: 0)) – Base seed used for shuffling chunks. Combined with epoch when shuffle_chunks is True.

  • shuffle_within_chunk (bool (default: False)) – If True, shuffle rows within each chunk.

  • buffer_seed (int | None (default: None)) – Base seed for within-chunk shuffling. If None, within-chunk shuffling is non-deterministic.

  • normalize (str (default: 'minmax01')) – Normalization mode for each instance tensor. Supported values are "minmax01", "unit", None or "none".

  • x_dtype (dtype (default: torch.float32)) – Output dtype for returned tensors.

  • return_instance_id (bool (default: True)) – If True, append the instance id to each output tuple.

  • return_row_index (bool (default: True)) – If True, append the global row index (=instance id index) to each output tuple.

  • drop_unlabeled (bool (default: True)) – If True and labels is provided, drop rows with label < 0.

  • allowed_chunk_indexes (ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] (default: None)) – Optional subset of chunk indices to iterate, e.g. for train/test splits.

See also

harpy.tb.extract_instances

Function to create the Zarr instances consumed by this dataset.

Methods table#

set_epoch(epoch)

Call this once per epoch.

Methods#

ZarrIterableInstances.set_epoch(epoch)#

Call this once per epoch.

Return type:

None