speechbrain.inference.classifiers module

Specifies the inference interfaces for Audio Classification modules.

Authors:
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • Adel Moumen 2023

  • Pradnya Kandarkar 2023

Summary

Classes:

AudioClassifier

A ready-to-use class for utterance-level classification (e.g, speaker-id, language-id, emotion recognition, keyword spotting, etc).

EncoderClassifier

A ready-to-use class for utterance-level classification (e.g, speaker-id, language-id, emotion recognition, keyword spotting, etc).

Reference

class speechbrain.inference.classifiers.EncoderClassifier(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use class for utterance-level classification (e.g, speaker-id, language-id, emotion recognition, keyword spotting, etc).

The class assumes that an encoder called “embedding_model” and a model called “classifier” are defined in the yaml file. If you want to convert the predicted index into a corresponding text label, please provide the path of the label_encoder in a variable called ‘lab_encoder_file’ within the yaml.

The class can be used either to run only the encoder (encode_batch()) to extract embeddings or to run a classification step (classify_batch()). ```

Example

>>> import torchaudio
>>> from speechbrain.inference.classifiers import EncoderClassifier
>>> # Model is downloaded from the speechbrain HuggingFace repo
>>> tmpdir = getfixture("tmpdir")
>>> classifier = EncoderClassifier.from_hparams(
...     source="speechbrain/spkrec-ecapa-voxceleb",
...     savedir=tmpdir,
... )
>>> classifier.hparams.label_encoder.ignore_len()
>>> # Compute embeddings
>>> signal, fs = torchaudio.load("tests/samples/single-mic/example1.wav")
>>> embeddings = classifier.encode_batch(signal)
>>> # Classification
>>> prediction = classifier.classify_batch(signal)
MODULES_NEEDED = ['compute_features', 'mean_var_norm', 'embedding_model', 'classifier']
encode_batch(wavs, wav_lens=None, normalize=False)[source]

Encodes the input audio into a single vector embedding.

The waveforms should already be in the model’s desired format. You can call: normalized = <this>.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model. Make sure the sample rate is fs=16000 Hz.

  • wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

  • normalize (bool) – If True, it normalizes the embeddings with the statistics contained in mean_var_norm_emb.

Returns:

The encoded batch

Return type:

torch.Tensor

classify_batch(wavs, wav_lens=None)[source]

Performs classification on the top of the encoded features.

It returns the posterior probabilities, the index and, if the label encoder is specified it also the text label.

Parameters:
  • wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model. Make sure the sample rate is fs=16000 Hz.

  • wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

  • out_prob – The log posterior probabilities of each class ([batch, N_class])

  • score – It is the value of the log-posterior for the best class ([batch,])

  • index – The indexes of the best class ([batch,])

  • text_lab – List with the text labels corresponding to the indexes. (label encoder should be provided).

classify_file(path, **kwargs)[source]

Classifies the given audiofile into the given set of labels.

Parameters:

path (str) – Path to audio file to classify.

Returns:

  • out_prob – The log posterior probabilities of each class ([batch, N_class])

  • score – It is the value of the log-posterior for the best class ([batch,])

  • index – The indexes of the best class ([batch,])

  • text_lab – List with the text labels corresponding to the indexes. (label encoder should be provided).

forward(wavs, wav_lens=None)[source]

Runs the classification

training: bool
class speechbrain.inference.classifiers.AudioClassifier(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use class for utterance-level classification (e.g, speaker-id, language-id, emotion recognition, keyword spotting, etc).

The class assumes that an encoder called “embedding_model” and a model called “classifier” are defined in the yaml file. If you want to convert the predicted index into a corresponding text label, please provide the path of the label_encoder in a variable called ‘lab_encoder_file’ within the yaml.

The class can be used either to run only the encoder (encode_batch()) to extract embeddings or to run a classification step (classify_batch()). ```

Example

>>> import torchaudio
>>> from speechbrain.inference.classifiers import AudioClassifier
>>> tmpdir = getfixture("tmpdir")
>>> classifier = AudioClassifier.from_hparams(
...     source="speechbrain/cnn14-esc50",
...     savedir=tmpdir,
... )
>>> signal = torch.randn(1, 16000)
>>> prediction, _, _, text_lab = classifier.classify_batch(signal)
>>> print(prediction.shape)
torch.Size([1, 1, 50])
classify_batch(wavs, wav_lens=None)[source]

Performs classification on the top of the encoded features.

It returns the posterior probabilities, the index and, if the label encoder is specified it also the text label.

Parameters:
  • wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model. Make sure the sample rate is fs=16000 Hz.

  • wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

  • out_prob – The log posterior probabilities of each class ([batch, N_class])

  • score – It is the value of the log-posterior for the best class ([batch,])

  • index – The indexes of the best class ([batch,])

  • text_lab – List with the text labels corresponding to the indexes. (label encoder should be provided).

classify_file(path, savedir='audio_cache')[source]

Classifies the given audiofile into the given set of labels.

Parameters:

path (str) – Path to audio file to classify.

Returns:

  • out_prob – The log posterior probabilities of each class ([batch, N_class])

  • score – It is the value of the log-posterior for the best class ([batch,])

  • index – The indexes of the best class ([batch,])

  • text_lab – List with the text labels corresponding to the indexes. (label encoder should be provided).

forward(wavs, wav_lens=None)[source]

Runs the classification

training: bool