speechbrain.inference.encoders module

Specifies the inference interfaces for speech and audio encoders.

Authors:
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • Adel Moumen 2023

  • Pradnya Kandarkar 2023

Summary

Classes:

MelSpectrogramEncoder

A MelSpectrogramEncoder class created for the Zero-Shot Multi-Speaker TTS models.

WaveformEncoder

A ready-to-use waveformEncoder model

Reference

class speechbrain.inference.encoders.WaveformEncoder(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use waveformEncoder model

It can be used to wrap different embedding models such as SSL ones (wav2vec2) or speaker ones (Xvector) etc. Two functions are available: encode_batch and encode_file. They can be used to obtain the embeddings directly from an audio file or from a batch of audio tensors respectively.

The given YAML must contain the fields specified in the *_NEEDED[] lists.

Example

>>> from speechbrain.inference.encoders import WaveformEncoder
>>> tmpdir = getfixture("tmpdir")
>>> ssl_model = WaveformEncoder.from_hparams(
...     source="speechbrain/ssl-wav2vec2-base-libri",
...     savedir=tmpdir,
... ) 
>>> ssl_model.encode_file("samples/audio_samples/example_fr.wav") 
MODULES_NEEDED = ['encoder']
encode_file(path, **kwargs)[source]

Encode the given audiofile into a sequence of embeddings.

Parameters:

path (str) – Path to audio file which to encode.

Returns:

The audiofile embeddings produced by this system.

Return type:

torch.Tensor

encode_batch(wavs, wav_lens)[source]

Encodes the input audio into a sequence of hidden states

The waveforms should already be in the model’s desired format.

Parameters:
  • wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.

  • wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

The encoded batch

Return type:

torch.Tensor

forward(wavs, wav_lens)[source]

Runs the encoder

training: bool
class speechbrain.inference.encoders.MelSpectrogramEncoder(*args, **kwargs)[source]

Bases: Pretrained

A MelSpectrogramEncoder class created for the Zero-Shot Multi-Speaker TTS models.

This is for speaker encoder models using the PyTorch MelSpectrogram transform for compatibility with the current TTS pipeline.

This class can be used to encode a single waveform, a single mel-spectrogram, or a batch of mel-spectrograms. ```

Example

>>> import torchaudio
>>> from speechbrain.inference.encoders import MelSpectrogramEncoder
>>> # Model is downloaded from the speechbrain HuggingFace repo
>>> tmpdir = getfixture("tmpdir")
>>> encoder = MelSpectrogramEncoder.from_hparams(
...     source="speechbrain/tts-ecapa-voxceleb",
...     savedir=tmpdir,
... ) 
>>> # Compute embedding from a waveform (sample_rate must match the sample rate of the encoder)
>>> signal, fs = torchaudio.load("tests/samples/single-mic/example1.wav") 
>>> spk_emb = encoder.encode_waveform(signal) 
>>> # Compute embedding from a mel-spectrogram (sample_rate must match the sample rate of the ecoder)
>>> mel_spec = encoder.mel_spectogram(audio=signal) 
>>> spk_emb = encoder.encode_mel_spectrogram(mel_spec) 
>>> # Compute embeddings for a batch of mel-spectrograms
>>> spk_embs = encoder.encode_mel_spectrogram_batch(mel_spec) 
MODULES_NEEDED = ['normalizer', 'embedding_model']
dynamic_range_compression(x, C=1, clip_val=1e-05)[source]

Dynamic range compression for audio signals

mel_spectogram(audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters:

audio (torch.tensor) – input audio signal

Returns:

mel – Mel-spectrogram

Return type:

torch.Tensor

encode_waveform(wav)[source]

Encodes a single waveform

Parameters:

wav (torch.Tensor) – waveform

Returns:

encoder_out – Speaker embedding for the input waveform

Return type:

torch.Tensor

encode_mel_spectrogram(mel_spec)[source]

Encodes a single mel-spectrograms

Parameters:

mel_spec (torch.Tensor) – Mel-spectrograms

Returns:

encoder_out – Speaker embedding for the input mel-spectrogram

Return type:

torch.Tensor

encode_mel_spectrogram_batch(mel_specs, lens=None)[source]

Encodes a batch of mel-spectrograms

Parameters:
Returns:

encoder_out – Speaker embedding for the input mel-spectrogram batch

Return type:

torch.Tensor

training: bool