speechbrain.integrations.hdf5.cached_item module

A pipeline for caching data transformations into hdf5 files.

Authors:
  • Peter Plantinga, 2025

  • Adel Moumen, 2025

Summary

Classes:

CachedHDF5DynamicItem

CachedDynamicItem that uses HDF5 to store the cache.

Reference

class speechbrain.integrations.hdf5.cached_item.CachedHDF5DynamicItem(cache_location, file_mode='a', cache_filename='cache.hdf5', compression=None, *args, **kwargs)[source]

Bases: CachedDynamicItem

CachedDynamicItem that uses HDF5 to store the cache. This performant data storage format only creates a single file, which may be faster or more efficient than the default storage (one torch file per id).

Parameters:
  • cache_location (os.PathLike) – Storage folder for containing HDF5 cached output file.

  • file_mode (str) – The mode to use when opening the HDF5 file. When creating the cache, writing must be allowed, but when reading from multiple processes, writing should not be allowed.

  • cache_filename (str) – The name of the HDF5 file to store the cache in.

  • compression (str or int, optional) – Compression to use for the HDF5 file. Valid values are β€œgzip”, β€œlzf”, β€œszip”, or an integer 0-9 (for gzip compression level). See h5py documentation for details. Example: compression=”gzip” or compression=4.

  • *args

  • **kwargs – Forwarded to DynamicItem constructor

property hdf5_path

Compute the full path to the HDF5 file from cache_location and cache_filename.

__getstate__()[source]

Get the state of the object for pickling. In case of pickling, we need to close the HDF5 file.

__setstate__(state)[source]

Set the state of the object for unpickling.

change_file_mode(new_file_mode)[source]

Change mode that the hdf5 file is opened with. Usually used to convert from writing format (building cache) to read-only format (multi-process loading).

classmethod cache(cache_location, file_mode='a', cache_filename='cache.hdf5', compression=None)[source]

Decorator which takes a DynamicItem and creates a CachedHDF5DynamicItem

Parameters:
  • cache_location (os.PathLike) – Storage folder for containing HDF5 cached output file.

  • file_mode (str) – The mode to use when opening the HDF5 file. When creating the cache, writing must be allowed, but when reading from multiple processes, writing should not be allowed.

  • cache_filename (str) – The name of the HDF5 file to store the cache in.

  • compression (str) – The compression algorithm to use for the HDF5 file.

Example

>>> import os, numpy
>>> from speechbrain.utils.data_pipeline import takes, provides
>>> tempdir = getfixture("tmpdir")
>>> @CachedHDF5DynamicItem.cache(tempdir)
... @takes("id", "text")
... @provides("tokenized")
... def count_to(id, limit):
...     return numpy.arange(limit)
>>> "utt_id" in count_to.hdf5file
False
>>> count_to("utt_id", 5)
array([0, 1, 2, 3, 4])
>>> "utt_id" in count_to.hdf5file
True
>>> # The output shouldn't change on the second call
>>> count_to("utt_id", 5)
array([0, 1, 2, 3, 4])
>>> # NOTE: NO INVALID CACHE DETECTION
>>> count_to("utt_id", 10)
array([0, 1, 2, 3, 4])