to execute or view/download this notebook on GitHub

Data Loading

Handling data consumes 90% of the active work time in many machine learning projects.

SpeechBrain complements standard PyTorch data loading in handling variable-length sequences, large datasets, and complex data transformation pipelines. These are typical challenges when working with speech, but SpeechBrain tries not to make assumptions about your data.

Install dependencies

%%capture
# Installing SpeechBrain via pip
BRANCH = 'speechllm_librispeech'
!python -m pip install git+https://github.com/speechbrain/speechbrain.git@$BRANCH

import speechbrain
import torch

In this tutorial we will use MiniLibriSpeech from https://www.openslr.org/resources/31: we download the validation set in the following two cells as well as images and scripts.

%%capture
# downloading mini_librispeech dev data
!wget https://www.openslr.org/resources/31/dev-clean-2.tar.gz
!tar -xvzf dev-clean-2.tar.gz

Preface: PyTorch data loading pipeline

SpeechBrain data-IO follows and extends PyTorch data loading. This preface section recaps PyTorch data loading and does not yet consider the SpeechBrain data loading extensions.

Overview

PyTorch data loading can run in many configurations, but a typical approach has these basic elements:

a Dataset, which loads data points one-at-a-time.
a collation function, or collate_fn for short, which takes a list of data points and forms a batch.
a Sampler, which determines the order in which the Dataset is iterated.
a DataLoader, which combines the elements above (and has defaults for collate_fn and Sampler), and orchestrates the whole pipeline.

Dataset

The role of the Dataset is to produce single data points. Typically they are loaded off the disk, but they could also come from some more complex source or in some cases just from RAM. You can write your own Dataset subclass or sometimes you can use a standardized class. The training, validation, and test subsets get their own Dataset instances.

The Dataset interface is simple; it implements __getitem__ and usually also __len__. Usually, “map-style” Datasets are used, but it’s worth noting that PyTorch also has a notion of IterableDatasets.

__getitem__ can return anything because data can look like anything. Often, though, a data point consists of multiple associated things (e.g. an image and a label, or a speech waveform and its transcription). The Dataset should return all of these associated things.

It is also relatively common for the Dataset to somehow transform the data on the fly, on the CPU.

Collation function

The collate_fn just converts a list of examples into a PyTorch tensor batch. If the data has variable-length sequences, collate_fn usually needs to implement padding.

Sampler

Typically users don’t need to create their own sampler; the two default options are to iterate in the original Dataset order or to iterate in random order.

DataLoader

The DataLoader takes the other elements above and many other arguments such as batch size. The DataLoader has basic defaults for all arguments (except the Dataset, of course), but it’s worthwhile to understand the args.

The DataLoader object is iterated in the training loop, and every Dataset instance (e.g. training, validation, test) gets its own DataLoader.

train_loader = DataLoader(train_data, collate_fn=PaddedBatch, batch_size=32, num_workers=2)
for batch in train_loader:
    pred = model(batch.signal)
    ...

The iterator that DataLoader returns can either load batches in the same process that it is created in (num_workers=0), or it can start a new process (num_workers=1) or multiple new processes (num_workers>1). Because of the Global Interpreter Lock, Python cannot work on two tasks at the same time in a single process. Using at least one background worker process to load data while simultaneously running training is often essential for taking full advantage of GPU compute resources.

SpeechBrain Basic dataIO

The basic dataIO pipeline is organized around three “key” blocks: DynamicItemDataset, Dynamic Items Pipelines (DIPs) and CategoricalEncoder which are tightly connected.

DynamicItemDataset inherits from torch.utils.data.Dataset and has been built to work together with Dynamic Items Pipelines to provide a straightforward and flexible way to fetch and transform data and labels from the raw dataset stored in your disk.

DIPs consists of user-defined functions in which the user specifies operations applied to metadata and data contained in the dataset. E.g. reading and augmenting an audio file or encoding a sequence of words using SentencePiece tokenizer. These functions are called inside DynamicItemDataset __getitem__ method and are run in parallel on the CPU.

The CategoricalEncoder is a convenient abstraction we provide for multi-class classification problems and it is sub-classed into TextEncoder and CTCTextEncoder which instead can be used for sequence-to-sequence applications related to text, such as ASR.

Thanks to these abstractions, most of the work necessary to set up the data IO pipeline is parsing the dataset into suitable annotation supported by DynamicItemDataset (SpeechBrain supports both CSV and JSON formats).

Once this annotation is ready, a flexible and efficient pipeline can be created in few lines of code, as SpeechBrain will take care under the hood of padding and other operations.

In the following tutorial, we will explain in detail how these blocks work. We will start from the required CSV or JSON annotation whose purpose is to represent and describe the information contained in the dataset. For example:

Paths to audio files, pre-extracted features et cetera.
Metadata such as words spoken in the audio files, Signal-to-Noise-Ratio, sound event tags, speaker identities et cetera.

Basically any information required to train your algorithm.

Dataset Annotation

SpeechBrain offers native support for JSON and CSV formats for describing a dataset and, in fact, in official recipes (such as LibriSpeech ASR recipes) we provide parsing scripts to obtain such formats.

We can take a glimpse to how these files can be structured using the downloaded Mini-LibriSpeech example.

Each file in Mini-LibriSpeech is an utterance from a single speaker, thus either JSON and CSV formats can be used to contain the absolute path to that file, the speaker identity and the words the speaker utters. This is enough to build an Automatic Speech Recognition (ASR) system.

Creating those files should be rather easy for most datasets as Python offers many tools for manipulating CSV and JSON files (such as pandas). In fact parsing the data to JSON, for example, can be done in few lines of codes:

%%capture
!pip install soundfile

import soundfile
from pathlib import Path

# Load all the audio files into one list
dev_clean_root = Path("./LibriSpeech/dev-clean-2")
flac_files = list(dev_clean_root.glob("**/*.flac"))
print("Number of flac audio files {}".format(len(flac_files)))

# we build a dictionary to map utterance id to words for each utterance
# each row in the text file is simply:
# <utt-id> <transcript>
text_files = dev_clean_root.glob("**/*.txt")
text_contents = [line.split(maxsplit=1) for file in text_files for line in open(file, encoding="utf8")]
words_dict = {utt_id: words.strip() for utt_id, words in text_contents}
print("Number of transcriptions {}".format(len(words_dict)))

# Our dictionary has four keys, including annotation of transcript and speaker identity,
# making this manifest suitable for automatic transcription or speaker identification
examples = {
    path.stem: {
        "file_path": str(path),
        "words": words_dict[path.stem],
        "spkID": int(path.stem.split("-")[0]),
        "length": soundfile.info(path).frames / soundfile.info(path).samplerate,
    }
    for path in flac_files
}

Number of flac audio files 1089
Number of transcriptions 1089

Both JSON and CSV formats are briefly illustrated here:

import json

with open("data.json", "w") as f:
  json.dump(examples, f, indent=4)

!head data.json

{
    "3576-138058-0019": {
        "file_path": "LibriSpeech/dev-clean-2/3576/138058/3576-138058-0019.flac",
        "words": "GIVE ME MY HORSE AND ARMS AND WAIT FOR ME HERE I WILL GO IN QUEST OF THIS KNIGHT AND DEAD OR ALIVE I WILL MAKE HIM KEEP HIS WORD PLIGHTED TO SO GREAT BEAUTY",
        "spkID": 3576,
        "length": 9.935
    },
    "3576-138058-0021": {
        "file_path": "LibriSpeech/dev-clean-2/3576/138058/3576-138058-0021.flac",
        "words": "THEY MADE HASTE TO OVERTAKE THEM WHICH AS THE PARTY MOVED SLOWLY THEY WERE ABLE TO DO WITH EASE",

import csv

# CSV is flat list with special "id" column
csv_examples = [{"id": key, **values} for key, values in examples.items()]
with open("data.csv", "w") as f:
    writer = csv.DictWriter(f, fieldnames=csv_examples[0].keys())
    writer.writeheader()
    writer.writerows(csv_examples)

!head data.csv

id,file_path,words,spkID,length
3576-138058-0019,LibriSpeech/dev-clean-2/3576/138058/3576-138058-0019.flac,GIVE ME MY HORSE AND ARMS AND WAIT FOR ME HERE I WILL GO IN QUEST OF THIS KNIGHT AND DEAD OR ALIVE I WILL MAKE HIM KEEP HIS WORD PLIGHTED TO SO GREAT BEAUTY,3576,9.935
3576-138058-0021,LibriSpeech/dev-clean-2/3576/138058/3576-138058-0021.flac,THEY MADE HASTE TO OVERTAKE THEM WHICH AS THE PARTY MOVED SLOWLY THEY WERE ABLE TO DO WITH EASE,3576,6.18
3576-138058-0028,LibriSpeech/dev-clean-2/3576/138058/3576-138058-0028.flac,CLAUDIA TOLD HIM SHE MEANT TO GO TO A MONASTERY OF WHICH AN AUNT OF HERS WAS ABBESS WHERE SHE INTENDED TO PASS HER LIFE WITH A BETTER AND EVERLASTING SPOUSE,3576,10.46
3576-138058-0014,LibriSpeech/dev-clean-2/3576/138058/3576-138058-0014.flac,DON QUIXOTE WAS ON FOOT WITH HIS HORSE UNBRIDLED AND HIS LANCE LEANING AGAINST A TREE AND IN SHORT COMPLETELY DEFENCELESS HE THOUGHT IT BEST THEREFORE TO FOLD HIS ARMS AND BOW HIS HEAD AND RESERVE HIMSELF FOR A MORE FAVOURABLE OCCASION AND OPPORTUNITY,3576,18.13
3576-138058-0000,LibriSpeech/dev-clean-2/3576/138058/3576-138058-0000.flac,MASTER AND MAN DISMOUNTED FROM THEIR BEASTS AND AS SOON AS THEY HAD SETTLED THEMSELVES AT THE FOOT OF THE TREES SANCHO WHO HAD HAD A GOOD NOONTIDE MEAL THAT DAY LET HIMSELF WITHOUT MORE ADO PASS THE GATES OF SLEEP,3576,14.14
3576-138058-0010,LibriSpeech/dev-clean-2/3576/138058/3576-138058-0010.flac,SANCHO ROSE AND REMOVED SOME DISTANCE FROM THE SPOT BUT AS HE WAS ABOUT TO PLACE HIMSELF LEANING AGAINST ANOTHER TREE HE FELT SOMETHING TOUCH HIS HEAD AND PUTTING UP HIS HANDS ENCOUNTERED SOMEBODY'S TWO FEET WITH SHOES AND STOCKINGS ON THEM,3576,15.735
3576-138058-0022,LibriSpeech/dev-clean-2/3576/138058/3576-138058-0022.flac,THE WOUNDED GENTLEMAN OPENED HIS ALL BUT CLOSED EYES AND RECOGNISING CLAUDIA SAID I SEE CLEARLY FAIR AND MISTAKEN LADY THAT IT IS THOU THAT HAST SLAIN ME A PUNISHMENT NOT MERITED OR DESERVED BY MY FEELINGS TOWARDS THEE FOR NEVER DID I MEAN TO NOR COULD I WRONG THEE IN THOUGHT OR DEED,3576,26.9
3576-138058-0005,LibriSpeech/dev-clean-2/3576/138058/3576-138058-0005.flac,SEEING THIS SANCHO GOT UP AND GRAPPLING WITH HIS MASTER HE GRIPPED HIM WITH ALL HIS MIGHT IN HIS ARMS GIVING HIM A TRIP WITH THE HEEL STRETCHED HIM ON THE GROUND ON HIS BACK AND PRESSING HIS RIGHT KNEE ON HIS CHEST HELD HIS HANDS IN HIS OWN SO THAT HE COULD NEITHER MOVE NOR BREATHE,3576,16.655
3576-138058-0017,LibriSpeech/dev-clean-2/3576/138058/3576-138058-0017.flac,HE SAW ME HE PAID COURT TO ME I LISTENED TO HIM AND UNKNOWN TO MY FATHER I LOVED HIM FOR THERE IS NO WOMAN HOWEVER SECLUDED SHE MAY LIVE OR CLOSE SHE MAY BE KEPT WHO WILL NOT HAVE OPPORTUNITIES AND TO SPARE FOR FOLLOWING HER HEADLONG IMPULSES,3576,17.325

Unlike other toolkits which have a rather strict requirements on how a dataset must be specified to be able to process it we do not impose any restriction on JSON and CSV syntax except the requirement for a different unique ID string for every single example.

This means that the JSON file must contain a dictionary whose keys are the example ids and each entry of the dictionary contains metadata for that example. Instead, CSV files, must have at least one column called id.

These are the only strict requirements to guarantee that the JSON and CSV dataset description files will work with SpeechBrain data IO pipeline.

Users are given great flexibility in how to represent their datasets in the JSON and CSV files as their goals and applications could be different: speech separation, enhancement, ASR, diarization, VAD et cetera.

This because in SpeechBrain we aim at many different tasks and datasets.

Every Dataset is unique, it can be single channel or multichannel, it can provide different metadata such as speaker IDs, or speaker positions or even multi-modal data such as audio and video.
Every task is unique, the annotation used in this example is suitable for applications like ASR and Speaker Recognition. But for diarization, for example, the user would like to have also the start and stop (in seconds, frames whatever !) of each uttererance too.

This also allows to keep the annotation very simple and focused to the particular task with only the necessary information for the current application, instead of having a cumbersone do-it-all annotation.

TIP

It is useful, when building the parsing script, to have a length or duration for each example containing the length of the example in seconds or samples or even frames. This allows for subsequent operations such as filtering examples that are too long (to avoid OOM issues) or sorting them for faster training. Regarding CSV files if a duration column is specified it is automatically cast to float when a DynamicItemDataset is built from the CSV.

Hereafter we show how the DynamicItemDataset, DIPs and CategoricalEncoder can be used to build a data pipeline for Speaker Recognition.

In particular we have to:

read the audio
read the speaker ID from annotation and encode it to integer
cache filterbank features once and reuse them across epochs

DynamicItemDataset

DynamicItemDataset is at the heart of SpeechBrain data pipeline and is built on top of torch.utils.data.Dataset.

As the name implies it allows the dynamical creation of new “objects” from the entries specified in the JSON (or CSV) dataset annotation.

#`creating a DynamiItemDataset instance from JSON or CSV annotation is immediate
from speechbrain.dataio.dataset import DynamicItemDataset

dataset = DynamicItemDataset.from_json("data.json") # or equivalently, DynamicItemDataset.from_csv("data.csv")

What does it mean dynamical creation of “objects” from the entries specified in the data.json annotation ?

dataset[0]

{}

As it is now, this Dataset object does not return anything.

Dynamical creation means exactly that, items the user wants to be returned must be specified in some way by the user. These items can depend on the entries specified in the data.json examples:

print(next(iter(examples.values())).keys())

dict_keys(['file_path', 'words', 'spkID', 'length'])

Namely from ['file_path', 'words', 'spkID', 'length'].

For example one “dynamic item” could be the audio signal which will depend on the 'file_path' key. Another one could be the spkID encoded to an integer value if one wishes to perform speaker recognition or, for ASR, it could be the words encoded by a tokenizer.

To obtain these “items” one should specify in some way a function which, when applied to the corresponding key, will provide the new item.

E.g. a function which reads the audio when applied to 'file_path' key. in order to make the Dataset class return, for each example. the audio signal.

Dynamic Item Pipelines (DIPs)

This task is handled by specifying Dynamic Item Pipelines for each dynamic item the user wants to get from the dataset. The user can specify an arbitrary number of pipelines.

For example, regarding the audio signal:

@speechbrain.utils.data_pipeline.takes("file_path")
@speechbrain.utils.data_pipeline.provides("signal")
def audio_pipeline(file_path):
      sig = speechbrain.dataio.dataio.read_audio(file_path)
      return sig

We specify a function that takes the file_path for each example and provides a new item called sig which is a tensor containing the audio.

We use here some pre-built function in speechbrain.dataio.dataio for reading audio. But the user can also use its own.

Once specified, the pipeline must be added to the DynamicItemDataset object and, following, the outputs requested by the user should be specified with the set_output_keys method. We request two items in the output: a new one sig and also file_path which is in the JSON annotation.

dataset.add_dynamic_item(audio_pipeline)
dataset.set_output_keys(["signal", "file_path"])
dataset[0]

{'signal': tensor([ 0.0020,  0.0006,  0.0004,  ..., -0.0033, -0.0034, -0.0029]),
 'file_path': 'LibriSpeech/dev-clean-2/3576/138058/3576-138058-0019.flac'}

Note that a more compact syntax can be used for applying a simple function like reading the audio file.

dataset.add_dynamic_item(speechbrain.dataio.dataio.read_audio, takes="file_path", provides="signal")
dataset.set_output_keys(["id", "signal", "words"])
dataset[0]

{'id': '3576-138058-0019',
 'signal': tensor([ 0.0020,  0.0006,  0.0004,  ..., -0.0033, -0.0034, -0.0029]),
 'words': 'GIVE ME MY HORSE AND ARMS AND WAIT FOR ME HERE I WILL GO IN QUEST OF THIS KNIGHT AND DEAD OR ALIVE I WILL MAKE HIM KEEP HIS WORD PLIGHTED TO SO GREAT BEAUTY'}

Now the dataset object will return this new specified item “sig” as well as the file_path as specified in the JSON. To show that its really loading the signal, let’s plot the waveform.

import matplotlib.pyplot as plt
plt.figure(1)
plt.title("Sig item")
plt.plot(dataset[0]["signal"])
plt.show()

../../_images/ef01f59f781caf55bec937708b7e5d347012854cab00ccc57596ace7d211f94d.png

It can be seen that the DynamicItemDataset object returns the two specified items from the JSON annotation.

The file_path item, is taken directly, as it is, from the JSON annotation without further processing. The other one is instead a new item derived from file_path item with the pipeline we have defined before.

There is no constraints in what can be done in the pipelines: as said the user can also use their own functions:

import soundfile as sf
@speechbrain.utils.data_pipeline.takes("file_path")
@speechbrain.utils.data_pipeline.provides("sig_numpy")
def audio_pipeline_numpy(file_path):
      sig, _ = sf.read(file_path, dtype="float32")
      return sig

speechbrain.dataio.dataset.add_dynamic_item([dataset], audio_pipeline_numpy)
speechbrain.dataio.dataset.set_output_keys(
        [dataset], ["signal", "file_path", "sig_numpy"],
    )
dataset[0]

{'signal': tensor([ 0.0020,  0.0006,  0.0004,  ..., -0.0033, -0.0034, -0.0029]),
 'file_path': 'LibriSpeech/dev-clean-2/3576/138058/3576-138058-0019.flac',
 'sig_numpy': array([ 0.00201416,  0.00061035,  0.00036621, ..., -0.00332642,
        -0.00335693, -0.00292969], dtype=float32)}

The dataset object now also returns the signal as read with the soundfile library instead of only the one with read with the built-in speechbrain function which is based on torchaudio.

Multiple outputs can be specified by one pipeline by using python generators syntax.

In the example below three outputs are specified, the last two depend directly from the first one (sig) and are transformed version of this latter: with a random gain factor rand_gain_sig and with a constant offset offset_sig.

import random
@speechbrain.utils.data_pipeline.takes("file_path")
@speechbrain.utils.data_pipeline.provides("sig", "rand_gain_sig", "offset_sig")
def audio_pipeline(file_path):
      sig = speechbrain.dataio.dataio.read_audio(file_path)
      yield sig
      rand_gain_sig = random.random()*sig
      yield rand_gain_sig
      offset_sig = sig + 1
      yield offset_sig


speechbrain.dataio.dataset.add_dynamic_item([dataset], audio_pipeline)
speechbrain.dataio.dataset.set_output_keys(
        [dataset], ["sig", "rand_gain_sig", "offset_sig"],
    )

plt.figure(1)
plt.title("Sig item")
plt.plot(dataset[0]["sig"])

plt.title("Sig item with random gain")
plt.plot(dataset[0]["rand_gain_sig"])

plt.title("Sig item offset")
plt.plot(dataset[0]["offset_sig"])
plt.legend(["sig", "rand_gain_sig", "offset_sig"])
plt.show()

../../_images/e447ee151ceec867e018b6debfd7e0fc035fbbc98e8b235cddb53c8f1ed5f130.png

This toy example demonstrates that multiple items can be fetched from the same pipeline and dynamically created items can depend on other dynamically created items (offset_sig depends on sig).

But dynamic items can also depend on dynamically created items from another, pre-specified pipeline:

@speechbrain.utils.data_pipeline.takes("sig")
@speechbrain.utils.data_pipeline.provides("sig_as_python_list")
def to_list_pipeline(sig):
      yield sig.numpy().tolist()


speechbrain.dataio.dataset.add_dynamic_item([dataset], to_list_pipeline)
speechbrain.dataio.dataset.set_output_keys(
        [dataset], ["sig_as_python_list"],
    )
dataset[0]["sig_as_python_list"][:10]

[0.00201416015625,
0006103515625,
0003662109375,
001129150390625,
000946044921875,
0001220703125,
 -0.000732421875,
00164794921875,
002685546875,
000457763671875]

In this example we have defined a new pipeline which takes sig and turns it from torch.tensor to a python list obtaining a new dynamic item sig_as_python_list.

NOTE

Since we are requesting in the output only sig_as_python_list which depends itself from sig, dynamic items offset_sig and rand_gain_sig are not computed at all. Only sig is computed implicitly as it is necessary to obtain sig_as_python_list.

In fact under the hood DynamicItemDataset finds a suitable evaluation order for the requested items by constructing a computational graph defined by the pipelines.

An error is returned if any circular dependency is present between the pipelines.

A DIP can also take multiple items/annotation keys in input, the syntax is the same as for the output items:

@speechbrain.utils.data_pipeline.takes("file_path", "spkID")
@speechbrain.utils.data_pipeline.provides("sig", "spkidstring")
def multiple_dip(file_path, spkID):
      sig = speechbrain.dataio.dataio.read_audio(file_path)
      yield sig
      yield spkID

speechbrain.dataio.dataset.add_dynamic_item([dataset], multiple_dip)
speechbrain.dataio.dataset.set_output_keys(
        [dataset], ["sig", "spkidstring"],
    )
dataset[0]

{'sig': tensor([ 0.0020,  0.0006,  0.0004,  ..., -0.0033, -0.0034, -0.0029]),
 'spkidstring': 3576}

@speechbrain.utils.data_pipeline.takes("file_path", "spkID")
@speechbrain.utils.data_pipeline.provides("sig_tuple")
def multiple_dip(file_path, spkID):
      sig = speechbrain.dataio.dataio.read_audio(file_path)
      yield sig, spkID


speechbrain.dataio.dataset.add_dynamic_item([dataset], multiple_dip)
speechbrain.dataio.dataset.set_output_keys(
        [dataset], ["sig_tuple"],
    )
dataset[0] # sig now is a tuple

{'sig_tuple': (tensor([ 0.0020,  0.0006,  0.0004,  ..., -0.0033, -0.0034, -0.0029]),
  3576)}

And also the same DIP can be used in multiple datasets. E.g. you want usually the read audio DIP to be the same for validation, training and test:

validation = DynamicItemDataset.from_json("data.json")
train = DynamicItemDataset.from_json("data.json")

speechbrain.dataio.dataset.add_dynamic_item([validation, train], speechbrain.dataio.dataio.read_audio, takes="file_path", provides="signal")
speechbrain.dataio.dataset.set_output_keys([validation, train], ["id", "signal", "words"])
validation[0]

{'id': '3576-138058-0019',
 'signal': tensor([ 0.0020,  0.0006,  0.0004,  ..., -0.0033, -0.0034, -0.0029]),
 'words': 'GIVE ME MY HORSE AND ARMS AND WAIT FOR ME HERE I WILL GO IN QUEST OF THIS KNIGHT AND DEAD OR ALIVE I WILL MAKE HIM KEEP HIS WORD PLIGHTED TO SO GREAT BEAUTY'}

train[0]

{'id': '3576-138058-0019',
 'signal': tensor([ 0.0020,  0.0006,  0.0004,  ..., -0.0033, -0.0034, -0.0029]),
 'words': 'GIVE ME MY HORSE AND ARMS AND WAIT FOR ME HERE I WILL GO IN QUEST OF THIS KNIGHT AND DEAD OR ALIVE I WILL MAKE HIM KEEP HIS WORD PLIGHTED TO SO GREAT BEAUTY'}

Cached pipeline

One last thing you can do with pipelines is to cache the result if the computation is expensive and static. For example, if you are using frozen deep embeddings, these may be a good candidate for caching. Here, we simply cache the feature extraction, although it doesn’t add much time to the overall pipeline.

filterbank = speechbrain.lobes.features.Fbank()

# Must take "id" for storing and retrieving from the cache
@speechbrain.utils.data_pipeline.CachedDynamicItem.cache("feature_cache")
@speechbrain.utils.data_pipeline.takes("id", "sig")
@speechbrain.utils.data_pipeline.provides("feats")
def feature_pipeline(uid, sig):
    # Fake batch dimension -- data items are singular
    return filterbank(sig.unsqueeze(0)).squeeze(0)

speechbrain.dataio.dataset.add_dynamic_item([dataset], feature_pipeline)
speechbrain.dataio.dataset.set_output_keys([dataset], ["id", "feats"])
print(dataset[0]["id"], dataset[0]["feats"].shape)
print(dataset[1]["id"], dataset[1]["feats"].shape)

3576-138058-0019 torch.Size([994, 40])
3576-138058-0021 torch.Size([619, 40])

The results of the dynamic items are stored in torch format, one file per uid:

import os
print(os.listdir("feature_cache"))
test_id = dataset[0]["id"]
print("Size of", test_id, torch.load(f"feature_cache/{test_id}.pt").shape)

['5694-64038-0005.pt', '1988-24833-0004.pt', '6295-244435-0018.pt', '5895-34622-0014.pt', '3576-138058-0001.pt', '1988-24833-0025.pt', '5895-34629-0025.pt', '2803-154320-0005.pt', '5338-284437-0006.pt', '1462-170145-0019.pt', '6295-244435-0040.pt', '7850-281318-0007.pt', '5338-284437-0025.pt', '2803-161169-0005.pt', '5895-34629-0001.pt', '5895-34615-0014.pt', '777-126732-0019.pt', '8297-275156-0010.pt', '251-136532-0003.pt', '2412-153948-0013.pt', '5895-34622-0022.pt', '1462-170145-0016.pt', '1272-135031-0019.pt', '2803-154320-0013.pt', '5338-284437-0009.pt', '6319-57405-0012.pt', '8842-304647-0002.pt', '3752-4944-0054.pt', '3536-23268-0008.pt', '6241-61946-0019.pt', '7976-110523-0013.pt', '2803-154320-0000.pt', '7976-110523-0006.pt', '3752-4944-0011.pt', '3752-4944-0038.pt', '5338-284437-0004.pt', '777-126732-0057.pt', '3536-23268-0000.pt', '1462-170142-0028.pt', '3752-4944-0039.pt', '1988-147956-0007.pt', '1988-24833-0028.pt', '251-118436-0018.pt', '2428-83699-0028.pt', '3576-138058-0035.pt', '1988-24833-0008.pt', '5895-34629-0002.pt', '3536-23268-0022.pt', '2428-83699-0041.pt', '3000-15664-0035.pt', '2428-83699-0023.pt', '3000-15664-0024.pt', '6295-244435-0000.pt', '6241-61946-0021.pt', '2035-147961-0002.pt', '7850-281318-0011.pt', '2412-153948-0015.pt', '1272-135031-0001.pt', '777-126732-0039.pt', '1462-170145-0005.pt', '5895-34629-0019.pt', '2428-83699-0024.pt', '2035-152373-0007.pt', '777-126732-0066.pt', '7850-286674-0012.pt', '3752-4944-0006.pt', '777-126732-0073.pt', '1272-141231-0010.pt', '2035-147961-0021.pt', '84-121550-0032.pt', '7976-110523-0003.pt', '84-121550-0001.pt', '174-168635-0007.pt', '1272-141231-0022.pt', '2412-153948-0000.pt', '3536-23268-0023.pt', '8842-304647-0005.pt', '2803-161169-0009.pt', '5895-34629-0030.pt', '2035-147960-0001.pt', '6319-57405-0000.pt', '3000-15664-0023.pt', '2428-83699-0029.pt', '8297-275156-0009.pt', '777-126732-0058.pt', '7850-286674-0017.pt', '1272-135031-0009.pt', '5895-34622-0020.pt', '7976-110523-0018.pt', '3576-138058-0028.pt', '2428-83699-0042.pt', '3536-23268-0025.pt', '8297-275156-0006.pt', '777-126732-0063.pt', '1988-24833-0007.pt', '5338-24640-0001.pt', '777-126732-0023.pt', '1462-170142-0006.pt', '2428-83699-0001.pt', '6241-61946-0001.pt', '174-168635-0012.pt', '251-118436-0014.pt', '6295-244435-0033.pt', '3752-4944-0004.pt', '251-136532-0006.pt', '777-126732-0070.pt', '5895-34622-0017.pt', '1988-24833-0024.pt', '3576-138058-0014.pt', '1462-170142-0010.pt', '1993-147964-0004.pt', '2035-147960-0016.pt', '84-121550-0014.pt', '1988-147956-0018.pt', '2035-152373-0014.pt', '6241-61943-0005.pt', '1272-141231-0028.pt', '5895-34615-0015.pt', '3536-23268-0005.pt', '1462-170142-0041.pt', '777-126732-0022.pt', '3576-138058-0036.pt', '1462-170145-0010.pt', '6295-244435-0010.pt', '1462-170142-0011.pt', '3000-15664-0009.pt', '1462-170145-0021.pt', '7850-281318-0019.pt', '3000-15664-0014.pt', '3752-4944-0002.pt', '3576-138058-0022.pt', '5338-24640-0007.pt', '3576-138058-0017.pt', '3000-15664-0031.pt', '3752-4944-0040.pt', '777-126732-0055.pt', '6241-61946-0012.pt', '2803-161169-0002.pt', '7850-281318-0014.pt', '2035-152373-0002.pt', '5694-64038-0000.pt', '5694-64038-0002.pt', '6241-61946-0008.pt', '7850-286674-0007.pt', '84-121550-0012.pt', '1988-24833-0002.pt', '2803-154320-0011.pt', '5895-34622-0023.pt', '251-136532-0022.pt', '2428-83699-0010.pt', '5895-34622-0003.pt', '7850-286674-0003.pt', '5694-64038-0011.pt', '5694-64038-0016.pt', '3536-23268-0024.pt', '3576-138058-0024.pt', '6241-61943-0027.pt', '1462-170142-0005.pt', '6319-57405-0004.pt', '1988-147956-0004.pt', '5338-284437-0029.pt', '1988-147956-0006.pt', '1462-170142-0000.pt', '3000-15664-0029.pt', '84-121550-0003.pt', '777-126732-0048.pt', '1993-147964-0005.pt', '5895-34622-0004.pt', '5895-34629-0013.pt', '7850-281318-0016.pt', '777-126732-0021.pt', '3536-23268-0002.pt', '6295-244435-0034.pt', '8842-304647-0013.pt', '3752-4944-0059.pt', '5338-284437-0020.pt', '1462-170142-0001.pt', '1988-24833-0013.pt', '6295-244435-0012.pt', '1272-141231-0018.pt', '6241-61943-0002.pt', '3752-4944-0056.pt', '251-136532-0014.pt', '2428-83699-0005.pt', '7976-110523-0005.pt', '8297-275156-0007.pt', '1988-24833-0019.pt', '1462-170142-0026.pt', '3000-15664-0022.pt', '8842-304647-0012.pt', '7850-281318-0006.pt', '3752-4944-0013.pt', '1988-147956-0020.pt', '5338-284437-0010.pt', '1988-24833-0014.pt', '251-136532-0023.pt', '3000-15664-0034.pt', '84-121550-0018.pt', '6295-244435-0038.pt', '174-168635-0000.pt', '5895-34629-0024.pt', '2803-161169-0010.pt', '1272-135031-0008.pt', '7976-110523-0014.pt', '2803-154320-0006.pt', '1993-147964-0007.pt', '2035-147961-0010.pt', '3576-138058-0009.pt', '2035-147961-0014.pt', '6241-61943-0023.pt', '5895-34615-0021.pt', '251-136532-0005.pt', '1272-141231-0009.pt', '3536-23268-0009.pt', '2428-83699-0004.pt', '3576-138058-0016.pt', '7976-110523-0010.pt', '84-121550-0010.pt', '1462-170142-0025.pt', '1462-170142-0029.pt', '5694-64038-0021.pt', '2428-83699-0032.pt', '3752-4944-0025.pt', '2035-147961-0016.pt', '3752-4944-0010.pt', '251-136532-0018.pt', '251-136532-0013.pt', '5338-24640-0005.pt', '2803-154320-0001.pt', '1988-147956-0014.pt', '777-126732-0046.pt', '2803-154320-0007.pt', '1272-141231-0000.pt', '1993-147964-0006.pt', '3752-4944-0052.pt', '1462-170142-0018.pt', '3000-15664-0025.pt', '1993-147964-0009.pt', '5338-284437-0003.pt', '2035-147961-0039.pt', '3000-15664-0033.pt', '1272-135031-0021.pt', '1462-170142-0004.pt', '3752-4944-0036.pt', '5895-34629-0026.pt', '777-126732-0027.pt', '6241-61943-0007.pt', '3752-4944-0065.pt', '5338-284437-0014.pt', '3536-23268-0001.pt', '3752-4944-0050.pt', '3752-4944-0034.pt', '2035-147961-0007.pt', '3000-15664-0013.pt', '5895-34622-0000.pt', '1272-135031-0018.pt', '777-126732-0031.pt', '3000-15664-0045.pt', '1462-170145-0007.pt', '3576-138058-0007.pt', '2035-147961-0023.pt', '6295-244435-0011.pt', '5338-284437-0024.pt', '2803-154320-0004.pt', '3752-4944-0017.pt', '174-168635-0014.pt', '6241-61943-0011.pt', '5694-64038-0025.pt', '2803-161169-0004.pt', '6295-244435-0023.pt', '5895-34615-0001.pt', '1462-170142-0008.pt', '1993-147964-0010.pt', '3576-138058-0025.pt', '7976-110523-0001.pt', '7850-286674-0016.pt', '3576-138058-0015.pt', '5694-64038-0001.pt', '777-126732-0047.pt', '174-168635-0010.pt', '3752-4944-0005.pt', '1988-24833-0022.pt', '1462-170145-0011.pt', '5895-34629-0022.pt', '1272-141231-0019.pt', '5895-34615-0003.pt', '7850-286674-0006.pt', '7850-281318-0001.pt', '174-168635-0002.pt', '1462-170142-0022.pt', '8297-275156-0000.pt', '6241-61946-0000.pt', '3000-15664-0030.pt', '5895-34615-0004.pt', '2803-161169-0017.pt', '2428-83699-0036.pt', '777-126732-0061.pt', '2428-83699-0008.pt', '2412-153948-0012.pt', '2428-83699-0021.pt', '2035-152373-0016.pt', '5338-24640-0004.pt', '7850-281318-0023.pt', '2035-147961-0036.pt', '2412-153948-0002.pt', '7976-110523-0009.pt', '2803-161169-0013.pt', '2035-152373-0015.pt', '5694-64038-0003.pt', '2035-147961-0038.pt', '3576-138058-0018.pt', '1462-170145-0022.pt', '6241-61946-0011.pt', '251-118436-0023.pt', '777-126732-0081.pt', '2035-147960-0015.pt', '6241-61946-0007.pt', '2803-161169-0011.pt', '2035-152373-0000.pt', '777-126732-0005.pt', '251-136532-0017.pt', '777-126732-0053.pt', '174-168635-0011.pt', '777-126732-0051.pt', '3536-23268-0014.pt', '2035-147961-0027.pt', '5895-34629-0005.pt', '6295-244435-0026.pt', '3000-15664-0002.pt', '251-136532-0015.pt', '2428-83699-0006.pt', '3752-4944-0026.pt', '1272-141231-0027.pt', '7850-286674-0002.pt', '1272-135031-0005.pt', '5895-34629-0015.pt', '7850-286674-0005.pt', '2035-147960-0006.pt', '6295-244435-0013.pt', '3000-15664-0010.pt', '2035-152373-0001.pt', '5895-34629-0009.pt', '2035-147960-0000.pt', '7976-110523-0002.pt', '3576-138058-0012.pt', '1272-135031-0016.pt', '251-118436-0000.pt', '1462-170142-0017.pt', '8842-304647-0004.pt', '174-168635-0022.pt', '6241-61946-0010.pt', '5338-284437-0015.pt', '777-126732-0062.pt', '8842-304647-0010.pt', '1462-170142-0002.pt', '777-126732-0012.pt', '2428-83699-0009.pt', '2035-147960-0010.pt', '3752-4944-0001.pt', '2428-83699-0000.pt', '2803-154320-0012.pt', '2803-154320-0002.pt', '3576-138058-0003.pt', '3752-4944-0061.pt', '3000-15664-0038.pt', '1272-141231-0013.pt', '7976-110523-0004.pt', '1462-170145-0015.pt', '5338-284437-0011.pt', '5895-34629-0027.pt', '1462-170142-0032.pt', '2035-147961-0013.pt', '1462-170145-0018.pt', '777-126732-0052.pt', '8297-275156-0003.pt', '8842-304647-0001.pt', '6241-61943-0008.pt', '5338-284437-0018.pt', '84-121550-0004.pt', '3536-23268-0013.pt', '2803-161169-0015.pt', '3576-138058-0011.pt', '6295-244435-0002.pt', '6241-61943-0012.pt', '1462-170142-0033.pt', '2035-147960-0009.pt', '6295-244435-0039.pt', '6241-61946-0020.pt', '2035-147961-0015.pt', '3752-4944-0067.pt', '84-121550-0020.pt', '3576-138058-0010.pt', '174-168635-0020.pt', '6295-244435-0024.pt', '3536-23268-0029.pt', '3752-4944-0000.pt', '251-118436-0003.pt', '3752-4944-0043.pt', '174-168635-0018.pt', '6295-244435-0025.pt', '84-121550-0028.pt', '777-126732-0028.pt', '777-126732-0038.pt', '5338-284437-0012.pt', '5338-284437-0032.pt', '777-126732-0060.pt', '6295-244435-0019.pt', '3000-15664-0015.pt', '2035-152373-0003.pt', '6241-61943-0016.pt', '251-118436-0010.pt', '6295-244435-0016.pt', '2035-147961-0020.pt', '6295-244435-0005.pt', '2428-83699-0034.pt', '777-126732-0045.pt', '3536-23268-0018.pt', '1272-141231-0004.pt', '777-126732-0030.pt', '2035-152373-0005.pt', '5338-284437-0002.pt', '1272-135031-0013.pt', '1462-170142-0009.pt', '6241-61946-0004.pt', '2035-152373-0006.pt', '84-121550-0025.pt', '5895-34615-0007.pt', '1988-147956-0017.pt', '2428-83699-0035.pt', '777-126732-0068.pt', '2035-152373-0017.pt', '5338-24640-0009.pt', '1988-147956-0021.pt', '3752-4944-0069.pt', '174-168635-0009.pt', '6295-244435-0030.pt', '6319-57405-0007.pt', '251-136532-0020.pt', '5895-34629-0000.pt', '84-121550-0033.pt', '1988-24833-0020.pt', '3752-4944-0016.pt', '5694-64038-0012.pt', '2035-147960-0004.pt', '6241-61946-0005.pt', '2035-147961-0005.pt', '1462-170142-0016.pt', '84-121550-0026.pt', '7850-286674-0000.pt', '251-136532-0007.pt', '6295-244435-0009.pt', '1988-147956-0012.pt', '2428-83699-0019.pt', '2428-83699-0016.pt', '3576-138058-0020.pt', '5895-34615-0010.pt', '84-121550-0008.pt', '251-118436-0009.pt', '1462-170142-0040.pt', '2035-152373-0010.pt', '7976-110523-0008.pt', '251-118436-0013.pt', '251-118436-0017.pt', '174-168635-0019.pt', '1993-147964-0003.pt', '777-126732-0067.pt', '1988-147956-0015.pt', '6241-61946-0006.pt', '8842-304647-0003.pt', '251-118436-0011.pt', '5895-34629-0031.pt', '5694-64038-0004.pt', '5694-64038-0008.pt', '1462-170142-0031.pt', '1272-141231-0021.pt', '6241-61943-0006.pt', '5895-34615-0017.pt', '1272-135031-0000.pt', '6295-244435-0032.pt', '3000-15664-0018.pt', '6241-61943-0026.pt', '5338-284437-0000.pt', '5895-34622-0009.pt', '2035-147961-0011.pt', '1988-24833-0003.pt', '2412-153948-0003.pt', '1462-170142-0020.pt', '5895-34622-0006.pt', '5895-34622-0016.pt', '6241-61943-0025.pt', '1272-141231-0003.pt', '5895-34629-0006.pt', '3536-23268-0012.pt', '1988-24833-0011.pt', '6319-57405-0011.pt', '3536-23268-0010.pt', '777-126732-0026.pt', '5895-34629-0029.pt', '3576-138058-0027.pt', '1272-141231-0023.pt', '8842-304647-0007.pt', '6241-61943-0022.pt', '5895-34615-0006.pt', '251-118436-0004.pt', '251-136532-0000.pt', '6241-61946-0014.pt', '777-126732-0075.pt', '1272-141231-0017.pt', '5895-34615-0016.pt', '3576-138058-0019.pt', '2428-83699-0012.pt', '1462-170142-0012.pt', '3752-4944-0048.pt', '84-121550-0002.pt', '6241-61943-0015.pt', '3000-15664-0046.pt', '3752-4944-0062.pt', '2803-161169-0012.pt', '3000-15664-0008.pt', '251-118436-0021.pt', '777-126732-0059.pt', '8842-304647-0008.pt', '5895-34615-0002.pt', '6319-57405-0010.pt', '5895-34622-0002.pt', '1993-147964-0001.pt', '2412-153948-0014.pt', '3000-15664-0012.pt', '7976-110523-0015.pt', '2428-83699-0013.pt', '6241-61946-0018.pt', '6295-244435-0020.pt', '6295-244435-0035.pt', '5338-284437-0022.pt', '1462-170142-0023.pt', '251-136532-0012.pt', '777-126732-0037.pt', '1462-170142-0034.pt', '8297-275156-0004.pt', '777-126732-0016.pt', '6241-61943-0004.pt', '1462-170142-0019.pt', '1272-141231-0026.pt', '5895-34615-0011.pt', '5895-34629-0021.pt', '2428-83699-0003.pt', '1272-135031-0012.pt', '3000-15664-0037.pt', '2035-147961-0012.pt', '7850-281318-0021.pt', '5895-34629-0010.pt', '84-121550-0035.pt', '7850-286674-0011.pt', '1988-147956-0008.pt', '7976-110523-0017.pt', '8297-275156-0013.pt', '2412-153948-0008.pt', '5694-64038-0006.pt', '3536-23268-0006.pt', '2803-154320-0003.pt', '777-126732-0069.pt', '1272-135031-0024.pt', '7976-110523-0011.pt', '2803-161169-0003.pt', '777-126732-0034.pt', '5895-34622-0012.pt', '6295-244435-0007.pt', '2035-147961-0008.pt', '84-121550-0015.pt', '1272-141231-0016.pt', '6241-61946-0009.pt', '2035-147961-0034.pt', '251-118436-0008.pt', '3536-23268-0030.pt', '3752-4944-0046.pt', '6241-61946-0002.pt', '1988-147956-0029.pt', '777-126732-0014.pt', '1272-141231-0030.pt', '5338-284437-0030.pt', '1272-141231-0031.pt', '777-126732-0024.pt', '7850-286674-0010.pt', '3576-138058-0021.pt', '3576-138058-0008.pt', '777-126732-0079.pt', '6241-61946-0003.pt', '174-168635-0008.pt', '5895-34615-0005.pt', '3752-4944-0051.pt', '251-136532-0002.pt', '2035-147961-0019.pt', '1988-24833-0018.pt', '7850-281318-0008.pt', '3752-4944-0057.pt', '6295-244435-0037.pt', '3752-4944-0031.pt', '5338-284437-0026.pt', '1988-24833-0006.pt', '5895-34629-0028.pt', '251-136532-0019.pt', '3752-4944-0018.pt', '777-126732-0008.pt', '174-168635-0004.pt', '1272-135031-0003.pt', '8297-275156-0011.pt', '251-136532-0016.pt', '2035-147960-0013.pt', '777-126732-0040.pt', '1988-24833-0026.pt', '3536-23268-0028.pt', '1462-170145-0000.pt', '2428-83699-0033.pt', '3576-138058-0031.pt', '2428-83699-0037.pt', '1462-170145-0009.pt', '3536-23268-0027.pt', '1988-24833-0023.pt', '8842-304647-0009.pt', '7850-281318-0018.pt', '3752-4944-0041.pt', '2035-147961-0009.pt', '1988-147956-0001.pt', '5895-34629-0020.pt', '2428-83699-0026.pt', '5338-284437-0027.pt', '3576-138058-0029.pt', '5895-34629-0007.pt', '3576-138058-0033.pt', '3576-138058-0034.pt', '3000-15664-0042.pt', '84-121550-0006.pt', '7976-110523-0007.pt', '5694-64038-0010.pt', '777-126732-0015.pt', '5895-34622-0011.pt', '1272-141231-0007.pt', '84-121550-0034.pt', '5338-24640-0002.pt', '2412-153948-0001.pt', '5338-284437-0023.pt', '6295-244435-0031.pt', '6241-61943-0021.pt', '251-136532-0008.pt', '5338-284437-0031.pt', '84-121550-0013.pt', '3536-23268-0016.pt', '3752-4944-0035.pt', '777-126732-0054.pt', '2428-83699-0015.pt', '84-121550-0023.pt', '6295-244435-0001.pt', '7850-286674-0009.pt', '3576-138058-0013.pt', '1462-170142-0036.pt', '777-126732-0001.pt', '6241-61943-0000.pt', '3536-23268-0003.pt', '1988-147956-0028.pt', '1272-141231-0014.pt', '7850-286674-0014.pt', '84-121550-0021.pt', '6241-61946-0023.pt', '2035-147960-0007.pt', '5338-284437-0021.pt', '84-121550-0029.pt', '2035-147961-0017.pt', '7850-281318-0017.pt', '5895-34622-0013.pt', '84-121550-0009.pt', '1272-135031-0014.pt', '8842-304647-0000.pt', '1988-147956-0025.pt', '5694-64038-0009.pt', '1988-147956-0002.pt', '5895-34615-0019.pt', '3752-4944-0022.pt', '2428-83699-0025.pt', '777-126732-0010.pt', '2035-152373-0018.pt', '7850-281318-0005.pt', '3752-4944-0003.pt', '2412-153948-0005.pt', '6241-61943-0024.pt', '2803-154320-0010.pt', '1272-141231-0005.pt', '84-121550-0027.pt', '3000-15664-0027.pt', '777-126732-0076.pt', '1272-135031-0022.pt', '1272-135031-0020.pt', '5895-34622-0010.pt', '5895-34629-0004.pt', '3000-15664-0040.pt', '1272-135031-0011.pt', '3752-4944-0045.pt', '5694-64038-0023.pt', '777-126732-0013.pt', '3000-15664-0016.pt', '777-126732-0000.pt', '6241-61946-0015.pt', '2412-153948-0007.pt', '5895-34622-0008.pt', '1272-141231-0032.pt', '2035-152373-0013.pt', '6319-57405-0005.pt', '2428-83699-0027.pt', '3000-15664-0006.pt', '2035-147960-0012.pt', '2428-83699-0039.pt', '8297-275156-0002.pt', '2428-83699-0020.pt', '1462-170145-0004.pt', '251-118436-0015.pt', '1272-135031-0007.pt', '5694-64038-0007.pt', '6241-61943-0017.pt', '5895-34615-0012.pt', '3752-4944-0066.pt', '84-121550-0030.pt', '251-118436-0019.pt', '3752-4944-0033.pt', '777-126732-0049.pt', '5895-34615-0018.pt', '5895-34622-0021.pt', '3752-4944-0023.pt', '5895-34629-0018.pt', '777-126732-0071.pt', '1272-141231-0008.pt', '3752-4944-0015.pt', '777-126732-0011.pt', '777-126732-0020.pt', '2803-161169-0007.pt', '3576-138058-0004.pt', '3000-15664-0011.pt', '2035-147961-0006.pt', '174-168635-0021.pt', '2803-154320-0014.pt', '777-126732-0035.pt', '3000-15664-0026.pt', '2803-161169-0000.pt', '6241-61943-0003.pt', '3752-4944-0053.pt', '2035-147961-0030.pt', '6295-244435-0008.pt', '5694-64038-0013.pt', '1462-170142-0039.pt', '3752-4944-0029.pt', '7850-281318-0003.pt', '6241-61943-0013.pt', '5694-64038-0022.pt', '5895-34615-0020.pt', '3576-138058-0037.pt', '2803-161169-0008.pt', '777-126732-0042.pt', '84-121550-0000.pt', '3000-15664-0021.pt', '5338-284437-0033.pt', '1272-135031-0017.pt', '6241-61943-0019.pt', '3752-4944-0007.pt', '5694-64038-0024.pt', '6241-61946-0022.pt', '1462-170142-0015.pt', '8297-275156-0008.pt', '3752-4944-0027.pt', '1462-170142-0038.pt', '5694-64038-0017.pt', '1988-24833-0021.pt', '84-121550-0016.pt', '777-126732-0074.pt', '5338-24640-0003.pt', '2428-83699-0018.pt', '777-126732-0004.pt', '2035-147961-0040.pt', '5895-34629-0032.pt', '3000-15664-0032.pt', '6241-61943-0001.pt', '777-126732-0032.pt', '251-118436-0005.pt', '3000-15664-0036.pt', '777-126732-0018.pt', '2428-83699-0014.pt', '6241-61946-0013.pt', '7850-281318-0009.pt', '5895-34622-0001.pt', '777-126732-0056.pt', '2035-147960-0014.pt', '6295-244435-0036.pt', '777-126732-0078.pt', '251-136532-0001.pt', '6319-57405-0003.pt', '6295-244435-0006.pt', '251-136532-0010.pt', '2428-83699-0011.pt', '7850-286674-0008.pt', '1272-141231-0025.pt', '3576-138058-0000.pt', '5895-34615-0008.pt', '3752-4944-0021.pt', '251-118436-0012.pt', '1462-170142-0013.pt', '5895-34629-0016.pt', '1988-147956-0009.pt', '5694-64038-0019.pt', '84-121550-0007.pt', '6241-61946-0016.pt', '3576-138058-0030.pt', '8842-304647-0006.pt', '1993-147964-0000.pt', '2035-147961-0001.pt', '251-118436-0016.pt', '1462-170145-0013.pt', '5338-284437-0017.pt', '6319-57405-0001.pt', '1272-141231-0020.pt', '6295-244435-0022.pt', '1988-147956-0000.pt', '777-126732-0033.pt', '1988-147956-0003.pt', '7850-281318-0000.pt', '2412-153948-0004.pt', '1988-147956-0023.pt', '8297-275156-0005.pt', '1462-170145-0008.pt', '1272-141231-0001.pt', '1272-135031-0010.pt', '6295-244435-0003.pt', '251-136532-0004.pt', '5895-34629-0014.pt', '3752-4944-0014.pt', '5895-34629-0012.pt', '7850-281318-0012.pt', '777-126732-0065.pt', '1462-170142-0024.pt', '3536-23268-0026.pt', '1988-147956-0005.pt', '7976-110523-0000.pt', '174-168635-0006.pt', '1988-24833-0016.pt', '1272-141231-0015.pt', '1462-170145-0020.pt', '3752-4944-0020.pt', '2803-161169-0016.pt', '777-126732-0044.pt', '1988-147956-0011.pt', '174-168635-0003.pt', '5694-64038-0018.pt', '2035-147960-0003.pt', '5895-34615-0009.pt', '3536-23268-0015.pt', '3752-4944-0030.pt', '3752-4944-0009.pt', '7976-110523-0021.pt', '174-168635-0015.pt', '7850-281318-0015.pt', '7850-281318-0022.pt', '5895-34629-0023.pt', '7976-110523-0020.pt', '6295-244435-0028.pt', '3536-23268-0019.pt', '777-126732-0041.pt', '3000-15664-0007.pt', '251-118436-0020.pt', '777-126732-0064.pt', '2412-153948-0011.pt', '5895-34622-0007.pt', '5338-284437-0016.pt', '2035-147960-0005.pt', '7850-286674-0013.pt', '3752-4944-0068.pt', '3752-4944-0019.pt', '2428-83699-0022.pt', '1988-147956-0019.pt', '1462-170145-0003.pt', '174-168635-0013.pt', '3536-23268-0020.pt', '3576-138058-0006.pt', '3752-4944-0008.pt', '3576-138058-0002.pt', '1988-147956-0010.pt', '5895-34629-0017.pt', '1462-170142-0037.pt', '1272-141231-0002.pt', '1988-24833-0010.pt', '777-126732-0002.pt', '2428-83699-0038.pt', '2035-147961-0037.pt', '1462-170145-0017.pt', '3000-15664-0005.pt', '7850-281318-0013.pt', '7850-281318-0020.pt', '7850-281318-0002.pt', '5694-64038-0014.pt', '2035-147960-0011.pt', '84-121550-0019.pt', '3752-4944-0037.pt', '5895-34629-0011.pt', '3752-4944-0044.pt', '1988-24833-0001.pt', '3752-4944-0055.pt', '2412-153948-0006.pt', '1462-170142-0027.pt', '1272-141231-0029.pt', '7976-110523-0016.pt', '2035-147961-0033.pt', '2803-161169-0001.pt', '1462-170145-0006.pt', '777-126732-0077.pt', '5338-284437-0019.pt', '5694-64038-0015.pt', '3000-15664-0044.pt', '2412-153948-0010.pt', '777-126732-0029.pt', '3752-4944-0047.pt', '3576-138058-0039.pt', '2035-147961-0028.pt', '2035-147961-0025.pt', '6319-57405-0006.pt', '8842-304647-0011.pt', '777-126732-0009.pt', '1988-24833-0015.pt', '8297-275156-0012.pt', '2428-83699-0040.pt', '1462-170142-0042.pt', '777-126732-0050.pt', '251-136532-0011.pt', '3000-15664-0020.pt', '5338-24640-0000.pt', '1272-141231-0006.pt', '2803-154320-0008.pt', '6295-244435-0004.pt', '3576-138058-0005.pt', '6295-244435-0017.pt', '5895-34622-0005.pt', '7976-110523-0012.pt', '3000-15664-0039.pt', '84-121550-0022.pt', '2428-83699-0017.pt', '174-168635-0017.pt', '2035-152373-0012.pt', '2035-147961-0022.pt', '3752-4944-0049.pt', '3000-15664-0019.pt', '1462-170142-0021.pt', '2035-147961-0031.pt', '1462-170142-0003.pt', '2412-153948-0009.pt', '3000-15664-0043.pt', '1988-147956-0027.pt', '7850-286674-0001.pt', '7976-110523-0019.pt', '251-118436-0002.pt', '251-136532-0021.pt', '1272-141231-0024.pt', '6241-61943-0009.pt', '777-126732-0017.pt', '3752-4944-0063.pt', '2035-147961-0004.pt', '84-121550-0005.pt', '777-126732-0003.pt', '3752-4944-0042.pt', '5895-34622-0015.pt', '3536-23268-0007.pt', '3000-15664-0028.pt', '2803-161169-0006.pt', '84-121550-0024.pt', '5895-34622-0019.pt', '2035-152373-0009.pt', '1462-170145-0012.pt', '777-126732-0043.pt', '6241-61943-0014.pt', '5895-34615-0000.pt', '5338-284437-0005.pt', '3752-4944-0058.pt', '2035-147961-0000.pt', '174-168635-0001.pt', '6241-61943-0020.pt', '3536-23268-0017.pt', '6319-57405-0009.pt', '1988-24833-0005.pt', '2035-147961-0024.pt', '1988-147956-0024.pt', '1272-135031-0004.pt', '6241-61946-0017.pt', '777-126732-0080.pt', '3536-23268-0004.pt', '2428-83699-0007.pt', '1988-147956-0022.pt', '3576-138058-0032.pt', '7850-281318-0004.pt', '1988-147956-0013.pt', '5895-34629-0033.pt', '251-118436-0006.pt', '3752-4944-0032.pt', '3752-4944-0024.pt', '1272-135031-0002.pt', '5338-284437-0013.pt', '5338-284437-0028.pt', '6295-244435-0029.pt', '174-168635-0005.pt', '5895-34622-0018.pt', '2803-161169-0014.pt', '5338-24640-0008.pt', '777-126732-0025.pt', '6241-61943-0018.pt', '1272-141231-0012.pt', '777-126732-0007.pt', '1993-147964-0008.pt', '1462-170142-0014.pt', '1988-24833-0012.pt', '84-121550-0031.pt', '2035-147961-0029.pt', '6295-244435-0027.pt', '1988-147956-0016.pt', '6319-57405-0008.pt', '2035-147961-0018.pt', '3576-138058-0038.pt', '3752-4944-0064.pt', '251-118436-0007.pt', '1988-24833-0009.pt', '84-121550-0011.pt', '2035-147961-0032.pt', '6319-57405-0002.pt', '3000-15664-0001.pt', '3752-4944-0060.pt', '1988-24833-0017.pt', '5338-284437-0008.pt', '3000-15664-0003.pt', '2428-83699-0002.pt', '1272-135031-0006.pt', '777-126732-0006.pt', '7850-286674-0004.pt', '5895-34615-0013.pt', '3000-15664-0041.pt', '251-136532-0009.pt', '2035-152373-0008.pt', '3000-15664-0000.pt', '5338-24640-0006.pt', '251-118436-0001.pt', '3752-4944-0028.pt', '2428-83699-0030.pt', '5895-34629-0003.pt', '251-118436-0022.pt', '8297-275156-0001.pt', '2035-147961-0026.pt', '2035-152373-0004.pt', '2035-147960-0002.pt', '3000-15664-0004.pt', '6295-244435-0021.pt', '1462-170145-0001.pt', '1462-170145-0002.pt', '7850-281318-0010.pt', '6295-244435-0014.pt', '1462-170145-0014.pt', '1988-147956-0026.pt', '2428-83699-0031.pt', '5694-64038-0020.pt', '3000-15664-0017.pt', '6241-61943-0010.pt', '777-126732-0072.pt', '1462-170142-0030.pt', '2035-152373-0011.pt', '1272-135031-0023.pt', '5338-284437-0001.pt', '3576-138058-0026.pt', '1272-135031-0015.pt', '777-126732-0036.pt', '3752-4944-0012.pt', '1462-170142-0007.pt', '6295-244435-0015.pt', '2035-147961-0003.pt', '2035-147960-0008.pt', '7850-286674-0015.pt', '5895-34629-0008.pt', '3576-138058-0023.pt', '3576-138058-0040.pt', '174-168635-0016.pt', '1462-170142-0035.pt', '3536-23268-0011.pt', '1993-147964-0002.pt', '1988-24833-0000.pt', '2803-154320-0009.pt', '5338-284437-0007.pt', '84-121550-0017.pt', '1988-24833-0027.pt', '3536-23268-0021.pt', '1272-141231-0011.pt', '2035-147961-0035.pt']
Size of 3576-138058-0019 torch.Size([994, 40])

You can warm the cache by just loading every item. We provide a convenience function for this:

dataset.iterate_once()

100%|██████████| 1089/1089 [00:09<00:00, 118.08it/s]

print("Number of files in the cache folder:", len(os.listdir("feature_cache")))

Number of files in the cache folder: 1089

HDF5 cached pipelines (single-file cache)

CachedHDF5DynamicItem keeps every cached output in a single HDF5 file instead of one .pt file per example. This reduces filesystem overhead and is easier to share across multi-worker dataloaders. The first argument still needs to be the example id, and optional compression is supported.

import os, shutil
from speechbrain.integrations.hdf5.cached_item import CachedHDF5DynamicItem

shutil.rmtree("hdf5_feature_cache", ignore_errors=True)
os.makedirs("hdf5_feature_cache", exist_ok=True)

hdf5_filterbank = speechbrain.lobes.features.Fbank()

@CachedHDF5DynamicItem.cache(
    "hdf5_feature_cache",
    compression="gzip",
    cache_filename="features.hdf5",
)
@speechbrain.utils.data_pipeline.takes("id", "sig")
@speechbrain.utils.data_pipeline.provides("hdf5_feats")
def hdf5_feature_pipeline(uid, sig):
    # Deterministic features are great candidates for caching.
    return hdf5_filterbank(sig.unsqueeze(0)).squeeze(0)

speechbrain.dataio.dataset.add_dynamic_item([dataset], hdf5_feature_pipeline)
speechbrain.dataio.dataset.set_output_keys([dataset], ["id", "hdf5_feats"])

first = dataset[0]
first["id"], first["hdf5_feats"].shape

('3576-138058-0019', torch.Size([994, 40]))

The cache now lives in a single file instead of many small ones:

hdf5_path = hdf5_feature_pipeline.hdf5_path
print("Cache file:", hdf5_path)
print("Datasets inside cache (truncated):", list(hdf5_feature_pipeline.hdf5file.keys())[:3])
print("Compression:", hdf5_feature_pipeline.hdf5file[first["id"]].compression)

Cache file: hdf5_feature_cache/features.hdf5
Datasets inside cache (truncated): ['3576-138058-0019']
Compression: gzip

You can warm the cache once, then reopen it read-only when using multiple DataLoader workers to avoid write contention:

dataset.iterate_once(output_keys=["id", "hdf5_feats"], progressbar=False)
hdf5_feature_pipeline.change_file_mode("r")
print("HDF5 mode after switching:", hdf5_feature_pipeline.hdf5file.mode)
print("Number of cached items:", len(hdf5_feature_pipeline.hdf5file.keys()))

HDF5 mode after switching: r
Number of cached items: 1089

CategoricalEncoder

SpeechBrain dataio provides a CategoricalEncoder class for encoding labels which belongs to a discrete set: e.g. for speaker recognition or any other multi-class classification problem.

Given a collection of hashables (e.g. strings) it encodes every unique item to an integer value: ["spk0", "spk1"] –> [0, 1]

Internally the correspondence between each label to its index is handled by two dictionaries: lab2ind and ind2lab.

It is built to tightly integrate with DynamicItemDataset and dataIO pipelines.

For example one can obtain the encoding for speaker identities (spkID in JSON) from our Mini-LibriSpeech dataset by creating an instance of CategoricalEncoder and fitting it to the dataset object.

from speechbrain.dataio.encoder import CategoricalEncoder
spk_id_encoder = CategoricalEncoder()

Since DynamicItemDataset right now does not return spkID we have firstly to set its output to return that dynamic item:

speechbrain.dataio.dataset.set_output_keys(
        [dataset], ["spkID"],
    )
# sig is a torch.tensor with audio signal as specified before.
# REMEMBER: no need to specify the pipeline for spkID as we can read directly the value from the JSON.
dataset[0]

{'spkID': 3576}

The speaker identity spkID is a string.

Note that in librispeech it is a string containing a unique integer so one can argue that performing the encoding here is pointless as casting to integer will suffice.

However, it could happen that it is not an unique integer but an unique string like spk1, spk2 et cetera.

spk_id_encoder can be used for this purpose. We fit the encoder to the dataset and specify which dynamic item we want to encode:

spk_id_encoder.update_from_didataset(dataset, "spkID")

NOTE

This will iterate over dataset fetch for each example spkID and construct internal dicts lab2ind and ind2lab.

Because of this, it is important to call dataset.set_output_keys avoiding computationally costly dynamic items (e.g. can happen if the pipeline does data augmentation) before fitting the encoder.

Setting only the key on which the encoder will be fitted is a good approach.

We can look now at how much unique speaker ids there are in the dataset by using __len__

len(spk_id_encoder)

We can also take a look to the encoder internal dictionaries lab2ind ans ind2lab which contains the mappings between the labels (speaker ids in this case) and corresponding integers encodings:

spk_id_encoder.lab2ind
# contains label --> integer encoding

{3576: 0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25}

spk_id_encoder.ind2lab # contains integer encoding --> label

Once fitted the CategoricalEncoder object can be used in a suitably defined pipeline to encode the spkID key and return the encoded value:

@speechbrain.utils.data_pipeline.takes("spkID")
@speechbrain.utils.data_pipeline.provides("spkid_encoded")
def spk_id_encoding(spkid):
      return torch.LongTensor([spk_id_encoder.encode_label(spkid)])

speechbrain.dataio.dataset.add_dynamic_item([dataset], spk_id_encoding)
speechbrain.dataio.dataset.set_output_keys(
        [dataset], ["spkid_encoded"],
    )

dataset[100]

{'spkid_encoded': tensor([2])}

PaddedBatch and SaveableDataLoader

SpeechBrain offers a way to conveniently pad right automatically tensors of different length on multiple dimensions. This is achieved using the PaddedBatch class defined in speechbrain.dataio.batch.

PaddedBatch is both a collate_fn and a batch object.

When a torch.utils.data.Dataset (and thus also a DynamicItemDataset) is passed to the Brain class PaddedBatch is used as the default collate function collate_fn and examples are automatically padded together.

As the default DataLoader the Brain class instantiates, a SpeechBrain custom DataLoader: speechbrain.dataio.dataloader.SaveableDataLoader.

This DataLoader is identical to the plain one except that it allows for intra-epoch saving. So if for some reason training stops in the middle of an epoch it is possible to resume from exactly that step. See the Checkpointing Tutorial. The default collate_fn for this DataLoader is PaddedBatch.

from speechbrain.dataio.dataloader import SaveableDataLoader
from speechbrain.dataio.batch import PaddedBatch

speechbrain.dataio.dataset.set_output_keys(
        [dataset], ["id", "spkid_encoded", "signal"],
    )

We set as the dynamic items we are requesting sig which is the audio tensor and the speaker id encoded with the CategoricalEncoder object (spkid_encoded) defined before and the example id.

dataloader = SaveableDataLoader(dataset, batch_size=2, collate_fn=PaddedBatch)

batch_obj = next(iter(dataloader)) # let's look at the batch obj

batch_obj # the dataloader returns an PaddedBatch obj now
type(batch_obj)

speechbrain.dataio.batch.PaddedBatch
def __init__(examples, padded_keys=None, device_prep_keys=None, padding_func=batch_pad_right, padding_kwargs=None, per_key_padding_kwargs=None, apply_default_convert=True, nonpadded_stack=True)

/usr/local/lib/python3.12/dist-packages/speechbrain/dataio/batch.pyCollate_fn when examples are dicts and have variable-length sequences.

Different elements in the examples get matched by key.
All numpy tensors get converted to Torch (PyTorch default_convert)
Then, by default, all torch.Tensor valued elements get padded and support
collective pin_memory() and to() calls.
Regular Python data types are just collected in a list.

Arguments
---------
examples : list
    List of example dicts, as produced by Dataloader.
padded_keys : list, None
    (Optional) List of keys to pad on. If None, pad all torch.Tensors
device_prep_keys : list, None
    (Optional) Only these keys participate in collective memory pinning and moving with
    to().
    If None, defaults to all items with torch.Tensor values.
padding_func : callable, optional
    Called with a list of tensors to be padded together. Needs to return
    two tensors: the padded data, and another tensor for the data lengths.
padding_kwargs : dict, None
    (Optional) Extra kwargs to pass to padding_func. E.G. mode, value
    This is used as the default padding configuration for all keys.
per_key_padding_kwargs : dict, None
    (Optional) Per-key padding configuration. Keys in this dict should match
    the keys in the examples. Each value should be a dict with padding parameters
    (e.g., {'value': -100, 'mode': 'constant'}). If a key is not in this dict,
    the global padding_kwargs will be used.
apply_default_convert : bool
    Whether to apply PyTorch default_convert (numpy to torch recursively,
    etc.) on all data. Default:True, usually does the right thing.
nonpadded_stack : bool
    Whether to apply PyTorch-default_collate-like stacking on values that
    didn't get padded. This stacks if it can, but doesn't error out if it
    cannot. Default:True, usually does the right thing.

Example
-------
>>> batch = PaddedBatch(
...     [
...         {"id": "ex1", "foo": torch.Tensor([1.0])},
...         {"id": "ex2", "foo": torch.Tensor([2.0, 1.0])},
...     ]
... )
>>> # Attribute or key-based access:
>>> batch.id
['ex1', 'ex2']
>>> batch["id"]
['ex1', 'ex2']
>>> # torch.Tensors get padded
>>> type(batch.foo)
<class 'speechbrain.dataio.batch.PaddedData'>
>>> batch.foo.data
tensor([[1., 0.],
        [2., 1.]])
>>> batch.foo.lengths
tensor([0.5000, 1.0000])
>>> # Batch supports collective operations:
>>> _ = batch.to(dtype=torch.half)
>>> batch.foo.data
tensor([[1., 0.],
        [2., 1.]], dtype=torch.float16)
>>> batch.foo.lengths
tensor([0.5000, 1.0000], dtype=torch.float16)
>>> # Numpy tensors get converted to torch and padded as well:
>>> import numpy as np
>>> batch = PaddedBatch(
...     [{"wav": np.asarray([1, 2, 3, 4])}, {"wav": np.asarray([1, 2, 3])}]
... )
>>> batch.wav  # +ELLIPSIS
PaddedData(data=tensor([[1, 2,...
>>> # Basic stacking collation deals with non padded data:
>>> batch = PaddedBatch(
...     [
...         {
...             "spk_id": torch.tensor([1]),
...             "wav": torch.tensor([0.1, 0.0, 0.3]),
...         },
...         {
...             "spk_id": torch.tensor([2]),
...             "wav": torch.tensor([0.2, 0.3, -0.1]),
...         },
...     ],
...     padded_keys=["wav"],
... )
>>> batch.spk_id
tensor([[1],
        [2]])
>>> # And some data is left alone:
>>> batch = PaddedBatch(
...     [{"text": ["Hello"]}, {"text": ["How", "are", "you?"]}]
... )
>>> batch.text
[['Hello'], ['How', 'are', 'you?']]
>>> # Per-key padding configuration:
>>> batch = PaddedBatch(
...     [
...         {
...             "wav": torch.tensor([1, 2, 3]),
...             "labels": torch.tensor([1, 2]),
...         },
...         {"wav": torch.tensor([4, 5]), "labels": torch.tensor([3])},
...     ],
...     per_key_padding_kwargs={
...         "wav": {"value": 0},
...         "labels": {"value": -100},
...     },
... )
>>> batch.wav.data
tensor([[1, 2, 3],
        [4, 5, 0]])
>>> batch.labels.data
tensor([[   1,    2],
        [   3, -100]])

Dynamic Items can be accessed in the batch object by using dict syntax:

batch_obj["spkid_encoded"]

PaddedData(data=tensor([[0],
        [0]]), lengths=tensor([1., 1.]))

batch_obj["signal"]

PaddedData(data=tensor([[ 0.0020,  0.0006,  0.0004,  ..., -0.0033, -0.0034, -0.0029],
        [-0.0018, -0.0023, -0.0027,  ...,  0.0000,  0.0000,  0.0000]]), lengths=tensor([1.0000, 0.6220]))

batch_obj["id"] # example ids in this batch useful for debugging

['3576-138058-0019', '3576-138058-0021']

As said, all elements in PaddedBatch which are torch.Tensors are padded together by adding zeros to the right. When these elements are accessed a namedtuple is returned: the actual padded tensors and a length tensor.

wav_data, length = batch_obj["signal"]

As it is a namedtuple the two items length and data are also accessible as attributes:

lengths = batch_obj["signal"].lengths
wav_data = batch_obj["signal"].data

lengths

tensor([1.0000, 0.6220])

This length tensor contains the relative true length of each sequence. In this example it means that the second example in the batch has not been padded (relative length == 1) while the first instead has been padded to more twice its length.

The use of relative lengths instead of absolute indexes guarantees that that these values do not change even after feature extraction: the relative true length remains the same even after STFT whatever is the window.

The absolute indexes are easy to obtain:

abs_lens = (lengths*wav_data.shape[1]).long()
abs_lens

tensor([158960,  98879])

wav_data[1][:abs_lens[1]] # no zeros

tensor([-0.0018, -0.0023, -0.0027,  ...,  0.0042,  0.0052,  0.0038])

wav_data[1][abs_lens[1]:] # zeros begins at abs_lens[0]

tensor([0.0019, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000])

The PaddedBatch object allows for conveniently moving all dynamic items which are torch.Tensor to the right device using to:

batch_obj = batch_obj.to("cpu")

Of course items which are not tensors such as id are not moved and are not padded. Instead they are simply returned as a list.

batch_obj["id"]

['3576-138058-0019', '3576-138058-0021']

It is also possible to iterate over the examples of PaddedBatch:

for ex in batch_obj:
  print(ex)

['3576-138058-0019', '3576-138058-0021']
PaddedData(data=tensor([[0],
        [0]]), lengths=tensor([1., 1.]))
PaddedData(data=tensor([[ 0.0020,  0.0006,  0.0004,  ..., -0.0033, -0.0034, -0.0029],
        [-0.0018, -0.0023, -0.0027,  ...,  0.0000,  0.0000,  0.0000]]), lengths=tensor([1.0000, 0.6220]))

And access a single example by its position:

batch_obj.at_position(1)

PaddedData(data=tensor([[0],
        [0]]), lengths=tensor([1., 1.]))

These methods can be conveniently used in the Brain class compute_forward and compute_objectives methods. As we have shown in the first example of this tutorial where a complete dataIO example was illustrated:

def compute_forward(self, batch, stage):
    audio, audio_len = batch["sig"]
    # the examples are automatically padded, audio_len contains the relative
    # length of the original sequence.
    return self.modules.model(audio.unsqueeze(1)).mean(-1).unsqueeze(-1)
    
  def compute_objectives(self, logits, batch, stage):
    spk_ids, _ = batch["spkid_encoded"]
    return torch.nn.functional.cross_entropy(logits, spk_ids)

Full Example: Training a simple Speaker Recognition System.

Hereafter we show how the DynamicItemDataset, DIPs and CategoricalEncoder can be used to build a data pipeline for Speaker Recognition.

In particular we have to:

read the audio
read the speaker ID from annotation and encode it to integer

We firstly instantiate the dataset from that JSON annotation

dataset = DynamicItemDataset.from_json("data.json")

Then fit the CategoricalEncoder to speaker IDs (spkID) in the annotation.

spk_id_encoder = CategoricalEncoder()
spk_id_encoder.update_from_didataset(dataset, "spkID")

We add the DIP which encodes spkID

dataset.add_dynamic_item(spk_id_encoder.encode_label_torch, takes="spkID", provides="spk_encoded")

We add the DIP for reading the audio

dataset.add_dynamic_item(speechbrain.dataio.dataio.read_audio, takes="file_path", provides="signal")

Caching features for the speaker pipeline

We can cache the filterbank features so each epoch only reads them from a single HDF5 file instead of recomputing them or creating thousands of small .pt files. This mirrors the caching workflow shown earlier, but plugs directly into the speaker-recognition DynamicItemDataset.

import os, shutil
from speechbrain.integrations.hdf5.cached_item import CachedHDF5DynamicItem
from speechbrain.lobes.features import Fbank

# Clean cache for a reproducible run in the notebook
shutil.rmtree("spk_feat_cache", ignore_errors=True)
os.makedirs("spk_feat_cache", exist_ok=True)

fbank = Fbank()

@CachedHDF5DynamicItem.cache(
    cache_location="spk_feat_cache",
    cache_filename="speaker_feats.hdf5",
    compression="gzip",
)
@speechbrain.utils.data_pipeline.takes("id", "signal")
@speechbrain.utils.data_pipeline.provides("feats")
def cached_fbank(uid, sig):
    return fbank(sig.unsqueeze(0)).squeeze(0)

dataset.add_dynamic_item(cached_fbank)

The cache is lazy: the first access to an utterance writes it into speaker_feats.hdf5. Reusing the notebook (or multiple workers) just reads from the same file.

and set the outputs of the dataset we want to access in training loop

dataset.set_output_keys(["id", "feats", "spk_encoded"])

We sort the dataset based on length to speed up training so that we minimize in batches the amount of padded elements.

sorted_data = dataset.filtered_sorted(sort_key="length")

We can train now a simple classifier, by passing the dataset object directly to the Brain class. The Brain class will automatically create a SaveableDataLoader with specified train_loader_kwargs and will be handling the padding for you.

from speechbrain.lobes.features import MFCC, Fbank
from speechbrain.nnet.losses import nll_loss


class SimpleBrain(speechbrain.Brain):
    def compute_forward(self, batch, stage):
        x = batch.feats.data
        x = self.modules.encoder(x)
        x = self.modules.pooling(x, batch.feats.lengths)
        x = self.modules.to_output(x)
        return self.modules.softmax(x)

    def compute_objectives(self, logits, batch, stage):
        return nll_loss(logits, batch.spk_encoded.data)

modules = {
    "encoder": torch.nn.Sequential(torch.nn.Linear(40, 256),
                                   torch.nn.ReLU()),
    "pooling": speechbrain.nnet.pooling.StatisticsPooling(),
    "to_output": torch.nn.Linear(512, len(spk_id_encoder)),
    "softmax": speechbrain.nnet.activations.Softmax(apply_log=True)
}
brain = SimpleBrain(modules, opt_class=lambda x: torch.optim.SGD(x, 0.01), run_opts={"device": "cpu"})
brain.fit(range(1), train_set=sorted_data,
          train_loader_kwargs={"batch_size": 16, "drop_last":True})

INFO:speechbrain.core:Gradscaler enabled: `False`
INFO:speechbrain.core:Using training precision: `--precision=fp32`
INFO:speechbrain.core:Using evaluation precision: `--eval_precision=fp32`
INFO:speechbrain.core:SimpleBrain Model Statistics:
* Total Number of Trainable Parameters: 23.8k
* Total Number of Parameters: 23.8k
* Trainable Parameters represent 100.0000% of the total size.
100%|██████████| 68/68 [00:25<00:00,  2.70it/s, train_loss=4.22]

Authors

SpeechBrain team: Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, and Adel Moumen (2026)

Acknowledgements

Many thanks to Nasser Benabderrazik (lenassero) who helped improving this Tutorial.

Citing SpeechBrain

If you use SpeechBrain in your research or business, please cite it using the following BibTeX entry:

@misc{speechbrainV1,
  title={Open-Source Conversational AI with {SpeechBrain} 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}