speechbrain.lobes.models.huggingface_wav2vec module

This lobe enables the integration of huggingface pretrained wav2vec2/hubert/wavlm models.

Reference: https://arxiv.org/abs/2006.11477 Reference: https://arxiv.org/abs/1904.05862 Reference: https://arxiv.org/abs/2110.13900 Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

Authors
  • Titouan Parcollet 2021

  • Boumadane Abdelmoumene 2021

Summary

Classes:

HuggingFaceWav2Vec2

This lobe enables the integration of HuggingFace and SpeechBrain pretrained wav2vec2.0/Hubert models.

HuggingFaceWav2Vec2Pretrain

This lobe enables the integration of HuggingFace

WeightedSSLModel

This lobe enables the integration of use of weighted sum representations from different layers in a SSL encoder.

Reference

class speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2(source, save_path, output_norm=False, freeze=False, freeze_feature_extractor=False, apply_spec_augment=False, output_all_hiddens=False)[source]

Bases: Module

This lobe enables the integration of HuggingFace and SpeechBrain pretrained wav2vec2.0/Hubert models.

Source paper wav2vec2.0: https://arxiv.org/abs/2006.11477 Source paper Hubert: https://arxiv.org/abs/2106.07447 Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

The model can be used as a fixed feature extractor or can be finetuned. It will download automatically the model from HuggingFace or use a local path.

Parameters:
  • source (str) – HuggingFace hub name: e.g “facebook/wav2vec2-large-lv60”

  • save_path (str) – Path (dir) of the downloaded model.

  • output_norm (bool (default: True)) – If True, a layer_norm (affine) will be applied to the output obtained from the wav2vec model.

  • freeze (bool (default: True)) – If True, the model is frozen. If False, the model will be trained alongside with the rest of the pipeline.

  • freeze_feature_extractor (bool (default: False)) – When freeze = False and freeze_feature_extractor True, the featue_extractor module of the model is Frozen. If False all the wav2vec model will be trained including featue_extractor module.

  • apply_spec_augment (bool (default: False)) – If True, the model will apply spec augment on the output of feature extractor (inside huggingface Wav2VecModel() class). If False, the model will not apply spec augment. We set this to false to prevent from doing it twice.

  • output_all_hiddens (bool (default: False)) – If True, the forward function outputs the hidden states from all transformer layers. For example wav2vec2-base has 12 transformer layers and the output is of shape (13, B, T, C), where a projection of the CNN output is added to the beginning. If False, the forward function outputs the hidden states only from the last transformer layer.

Example

>>> inputs = torch.rand([10, 600])
>>> model_hub = "facebook/wav2vec2-base-960h"
>>> save_path = "savedir"
>>> model = HuggingFaceWav2Vec2(model_hub, save_path)
>>> outputs = model(inputs)
forward(wav, wav_lens=None)[source]

Takes an input waveform and return its corresponding wav2vec encoding.

Parameters:
  • wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.

  • wav_len (tensor) – The relative length of the wav given in SpeechBrain format.

extract_features(wav, wav_lens=None)[source]

Takes an input waveform and return its corresponding wav2vec encoding.

Parameters:
  • wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.

  • wav_len (tensor) – The relative length of the wav given in SpeechBrain format.

make_masks(src, wav_len=None, pad_idx=0)[source]

This method generates the padding masks. :param src: The sequence to the encoder (required). :type src: tensor :param wav_len: The relative length of the wav given in SpeechBrain format. :type wav_len: tensor :param pad_idx: The index for <pad> token (default=0). :type pad_idx: int

training: bool
class speechbrain.lobes.models.huggingface_wav2vec.HuggingFaceWav2Vec2Pretrain(source, save_path, mask_prob=0.65, mask_length=10, normalize_wav=True)[source]

Bases: Module

This lobe enables the integration of HuggingFace

wav2vec2.0 models to be pretrained.

Source paper: https://arxiv.org/abs/2006.11477 Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

The return is an HuggingFace format and the mask indices that contains: https://huggingface.co/transformers/model_doc/wav2vec2.html#wav2vec2forpretraining

For instance, it returns the loss that can be accessed with .loss

Parameters:
  • source (str) – HuggingFace hub name: e.g “facebook/wav2vec2-large-lv60”

  • save_path (str) – Path (dir) of the downloaded model.

  • mask_prob (float (default: 0.65)) – Probability of masking a given frame. Default is taken from the paper.

  • mask_length (float (default: 10)) – Length (i.e. number of consecutive masked frames). Default is taken from the paper.

Example

>>> inputs = torch.rand([10, 32000])
>>> model_hub = "facebook/wav2vec2-base-960h"
>>> save_path = "savedir"
>>> model = HuggingFaceWav2Vec2Pretrain(model_hub, save_path)
>>> outputs, _ = model(inputs, wav_lens=None)
forward(wav, wav_lens=None)[source]

Takes an input waveform and return its corresponding wav2vec encoding.

Parameters:
  • wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.

  • wav_len (tensor) – The relative length of the wav given in SpeechBrain format.

make_padding_masks(src, wav_len=None, pad_idx=0)[source]

This method generates the padding masks. :param src: The sequence to the encoder (required). :type src: tensor :param wav_len: The relative length of the wav given in SpeechBrain format. :type wav_len: tensor :param pad_idx: The index for <pad> token (default=0). :type pad_idx: int

training: bool
class speechbrain.lobes.models.huggingface_wav2vec.WeightedSSLModel(hub, num_layers, layernorm=False)[source]

Bases: Module

This lobe enables the integration of use of weighted sum representations from different layers in a SSL encoder.

The model can be used as a fixed feature extractor for SSL benchmarking. It will download automatically the model from HuggingFace or use a local path.

More details in recipes/SSL_benchmark

Parameters:
  • hub (str) – HuggingFace hub name: e.g “facebook/wav2vec2-large-lv60”

  • num_layers (int) – Number of internal layers: e.g 13 for “Base” models.

  • layernorm (bool) – Whether layer representations should be layernormed before sum

Example

>>> inputs = torch.rand([10, 600])
>>> model_hub = "facebook/wav2vec2-base-960h"
>>> num_layers = 13
>>> model = WeightedSSLModel(model_hub, num_layers)
>>> outputs = model(inputs)
forward(wav, wav_lens=None)[source]

This method outputs a weighted sum of the layers representations of the SSL encoder :param wav: The wavs :type wav: tensor

training: bool