speechbrain.integrations.hdf5.cached_item module

A pipeline for caching data transformations into hdf5 files.

Authors:

Peter Plantinga, 2025
Adel Moumen, 2025

Summary

Classes:

CachedHDF5DynamicItem

CachedDynamicItem that uses HDF5 to store the cache.

Reference

class speechbrain.integrations.hdf5.cached_item.CachedHDF5DynamicItem(cache_location, file_mode='a', cache_filename='cache.hdf5', compression=None, *args, **kwargs)[source]

Bases: CachedDynamicItem

CachedDynamicItem that uses HDF5 to store the cache. This performant data storage format only creates a single file, which may be faster or more efficient than the default storage (one torch file per id).

Parameters:

cache_location (os.PathLike) – Storage folder for containing HDF5 cached output file.
file_mode (str) – The mode to use when opening the HDF5 file. When creating the cache, writing must be allowed, but when reading from multiple processes, writing should not be allowed.
cache_filename (str) – The name of the HDF5 file to store the cache in.
compression (str or int, optional) – Compression to use for the HDF5 file. Valid values are “gzip”, “lzf”, “szip”, or an integer 0-9 (for gzip compression level). See h5py documentation for details. Example: compression=”gzip” or compression=4.
*args
**kwargs – Forwarded to DynamicItem constructor

property hdf5_path: Compute the full path to the HDF5 file from cache_location and cache_filename.

__getstate__()[source]: Get the state of the object for pickling. In case of pickling, we need to close the HDF5 file.

__setstate__(state)[source]: Set the state of the object for unpickling.

change_file_mode(new_file_mode)[source]: Change mode that the hdf5 file is opened with. Usually used to convert from writing format (building cache) to read-only format (multi-process loading).

classmethod cache(cache_location, file_mode='a', cache_filename='cache.hdf5', compression=None)[source]

Decorator which takes a DynamicItem and creates a CachedHDF5DynamicItem

Parameters:

cache_location (os.PathLike) – Storage folder for containing HDF5 cached output file.
file_mode (str) – The mode to use when opening the HDF5 file. When creating the cache, writing must be allowed, but when reading from multiple processes, writing should not be allowed.
cache_filename (str) – The name of the HDF5 file to store the cache in.
compression (str) – The compression algorithm to use for the HDF5 file.

Example

>>> import os, numpy
>>> from speechbrain.utils.data_pipeline import takes, provides
>>> tempdir = getfixture("tmpdir")
>>> @CachedHDF5DynamicItem.cache(tempdir)
... @takes("id", "text")
... @provides("tokenized")
... def count_to(id, limit):
...     return numpy.arange(limit)
>>> "utt_id" in count_to.hdf5file
False
>>> count_to("utt_id", 5)
array([0, 1, 2, 3, 4])
>>> "utt_id" in count_to.hdf5file
True
>>> # The output shouldn't change on the second call
>>> count_to("utt_id", 5)
array([0, 1, 2, 3, 4])
>>> # NOTE: NO INVALID CACHE DETECTION
>>> count_to("utt_id", 10)
array([0, 1, 2, 3, 4])