speechbrain.integrations.hdf5.cached_item moduleο
A pipeline for caching data transformations into hdf5 files.
- Authors:
Peter Plantinga, 2025
Adel Moumen, 2025
Summaryο
Classes:
CachedDynamicItem that uses HDF5 to store the cache. |
Referenceο
- class speechbrain.integrations.hdf5.cached_item.CachedHDF5DynamicItem(cache_location, file_mode='a', cache_filename='cache.hdf5', compression=None, *args, **kwargs)[source]ο
Bases:
CachedDynamicItemCachedDynamicItem that uses HDF5 to store the cache. This performant data storage format only creates a single file, which may be faster or more efficient than the default storage (one torch file per id).
- Parameters:
cache_location (os.PathLike) β Storage folder for containing HDF5 cached output file.
file_mode (str) β The mode to use when opening the HDF5 file. When creating the cache, writing must be allowed, but when reading from multiple processes, writing should not be allowed.
cache_filename (str) β The name of the HDF5 file to store the cache in.
compression (str or int, optional) β Compression to use for the HDF5 file. Valid values are βgzipβ, βlzfβ, βszipβ, or an integer 0-9 (for gzip compression level). See h5py documentation for details. Example: compression=βgzipβ or compression=4.
*args
**kwargs β Forwarded to DynamicItem constructor
- property hdf5_pathο
Compute the full path to the HDF5 file from cache_location and cache_filename.
- __getstate__()[source]ο
Get the state of the object for pickling. In case of pickling, we need to close the HDF5 file.
- change_file_mode(new_file_mode)[source]ο
Change mode that the hdf5 file is opened with. Usually used to convert from writing format (building cache) to read-only format (multi-process loading).
- classmethod cache(cache_location, file_mode='a', cache_filename='cache.hdf5', compression=None)[source]ο
Decorator which takes a DynamicItem and creates a CachedHDF5DynamicItem
- Parameters:
cache_location (os.PathLike) β Storage folder for containing HDF5 cached output file.
file_mode (str) β The mode to use when opening the HDF5 file. When creating the cache, writing must be allowed, but when reading from multiple processes, writing should not be allowed.
cache_filename (str) β The name of the HDF5 file to store the cache in.
compression (str) β The compression algorithm to use for the HDF5 file.
Example
>>> import os, numpy >>> from speechbrain.utils.data_pipeline import takes, provides >>> tempdir = getfixture("tmpdir") >>> @CachedHDF5DynamicItem.cache(tempdir) ... @takes("id", "text") ... @provides("tokenized") ... def count_to(id, limit): ... return numpy.arange(limit) >>> "utt_id" in count_to.hdf5file False >>> count_to("utt_id", 5) array([0, 1, 2, 3, 4]) >>> "utt_id" in count_to.hdf5file True >>> # The output shouldn't change on the second call >>> count_to("utt_id", 5) array([0, 1, 2, 3, 4]) >>> # NOTE: NO INVALID CACHE DETECTION >>> count_to("utt_id", 10) array([0, 1, 2, 3, 4])