speechbrain.inference.classifiers module

Specifies the inference interfaces for Audio Classification modules.

Authors:

Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023
Adel Moumen 2023
Pradnya Kandarkar 2023

Summary

Classes:

`AudioClassifier`	A ready-to-use class for utterance-level classification (e.g, speaker-id, language-id, emotion recognition, keyword spotting, etc).
`EncoderClassifier`	A ready-to-use class for utterance-level classification (e.g, speaker-id, language-id, emotion recognition, keyword spotting, etc).

Reference

class speechbrain.inference.classifiers.EncoderClassifier(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use class for utterance-level classification (e.g, speaker-id, language-id, emotion recognition, keyword spotting, etc).

The class assumes that an encoder called “embedding_model” and a model called “classifier” are defined in the yaml file. If you want to convert the predicted index into a corresponding text label, please provide the path of the label_encoder in a variable called ‘lab_encoder_file’ within the yaml.

The class can be used either to run only the encoder (encode_batch()) to extract embeddings or to run a classification step (classify_batch()). ```

Example

>>> import torchaudio
>>> from speechbrain.inference.classifiers import EncoderClassifier
>>> # Model is downloaded from the speechbrain HuggingFace repo
>>> tmpdir = getfixture("tmpdir")
>>> classifier = EncoderClassifier.from_hparams(
...     source="speechbrain/spkrec-ecapa-voxceleb",
...     savedir=tmpdir,
... )
>>> classifier.hparams.label_encoder.ignore_len()

>>> # Compute embeddings
>>> signal, fs = torchaudio.load("tests/samples/single-mic/example1.wav")
>>> embeddings = classifier.encode_batch(signal)

>>> # Classification
>>> prediction = classifier.classify_batch(signal)

MODULES_NEEDED = ['compute_features', 'mean_var_norm', 'embedding_model', 'classifier']

encode_batch(wavs, wav_lens=None, normalize=False)[source]

Encodes the input audio into a single vector embedding.

The waveforms should already be in the model’s desired format. You can call: normalized = <this>.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:

wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model. Make sure the sample rate is fs=16000 Hz.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
normalize (bool) – If True, it normalizes the embeddings with the statistics contained in mean_var_norm_emb.

Returns:

The encoded batch

Return type:

torch.Tensor

classify_batch(wavs, wav_lens=None)[source]

Performs classification on the top of the encoded features.

It returns the posterior probabilities, the index and, if the label encoder is specified it also the text label.

Parameters:

wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model. Make sure the sample rate is fs=16000 Hz.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

out_prob – The log posterior probabilities of each class ([batch, N_class])
score – It is the value of the log-posterior for the best class ([batch,])
index – The indexes of the best class ([batch,])
text_lab – List with the text labels corresponding to the indexes. (label encoder should be provided).

classify_file(path, **kwargs)[source]

Classifies the given audiofile into the given set of labels.

Parameters:

path (str) – Path to audio file to classify.

Returns:

out_prob – The log posterior probabilities of each class ([batch, N_class])
score – It is the value of the log-posterior for the best class ([batch,])
index – The indexes of the best class ([batch,])
text_lab – List with the text labels corresponding to the indexes. (label encoder should be provided).

forward(wavs, wav_lens=None)[source]: Runs the classification

training: bool

class speechbrain.inference.classifiers.AudioClassifier(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use class for utterance-level classification (e.g, speaker-id, language-id, emotion recognition, keyword spotting, etc).

The class assumes that an encoder called “embedding_model” and a model called “classifier” are defined in the yaml file. If you want to convert the predicted index into a corresponding text label, please provide the path of the label_encoder in a variable called ‘lab_encoder_file’ within the yaml.

The class can be used either to run only the encoder (encode_batch()) to extract embeddings or to run a classification step (classify_batch()). ```

Example

>>> import torchaudio
>>> from speechbrain.inference.classifiers import AudioClassifier
>>> tmpdir = getfixture("tmpdir")
>>> classifier = AudioClassifier.from_hparams(
...     source="speechbrain/cnn14-esc50",
...     savedir=tmpdir,
... )
>>> signal = torch.randn(1, 16000)
>>> prediction, _, _, text_lab = classifier.classify_batch(signal)
>>> print(prediction.shape)
torch.Size([1, 1, 50])

classify_batch(wavs, wav_lens=None)[source]

Performs classification on the top of the encoded features.

It returns the posterior probabilities, the index and, if the label encoder is specified it also the text label.

Parameters:

wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model. Make sure the sample rate is fs=16000 Hz.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

out_prob – The log posterior probabilities of each class ([batch, N_class])
score – It is the value of the log-posterior for the best class ([batch,])
index – The indexes of the best class ([batch,])
text_lab – List with the text labels corresponding to the indexes. (label encoder should be provided).

classify_file(path, savedir='audio_cache')[source]

Classifies the given audiofile into the given set of labels.

Parameters:

path (str) – Path to audio file to classify.

Returns:

out_prob – The log posterior probabilities of each class ([batch, N_class])
score – It is the value of the log-posterior for the best class ([batch,])
index – The indexes of the best class ([batch,])
text_lab – List with the text labels corresponding to the indexes. (label encoder should be provided).

forward(wavs, wav_lens=None)[source]: Runs the classification

training: bool