speechbrain.lobes.models.fairseq_wav2vec module

This lobe enables the integration of fairseq pretrained wav2vec models.

Reference: https://arxiv.org/abs/2006.11477 Reference: https://arxiv.org/abs/1904.05862 FairSeq >= 1.0.0 needs to be installed: https://fairseq.readthedocs.io/en/latest/

Authors

Titouan Parcollet 2021
Salima Mdhaffar 2021

Summary

Classes:

`FairseqWav2Vec1`	This lobes enables the integration of fairseq pretrained wav2vec1.0 models.
`FairseqWav2Vec2`	This lobe enables the integration of fairseq pretrained wav2vec2.0 models.

Reference

class speechbrain.lobes.models.fairseq_wav2vec.FairseqWav2Vec2(pretrained_path, save_path, input_norm=None, output_norm=False, freeze=False, freeze_feature_extractor=False, pretrain=True, dropout=None, layer_drop=None)[source]

Bases: Module

This lobe enables the integration of fairseq pretrained wav2vec2.0 models.

Source paper: https://arxiv.org/abs/2006.11477 FairSeq >= 0.10.0 needs to be installed: https://fairseq.readthedocs.io/en/latest/

The model can be used as a fixed features extractor or can be finetuned. It will download automatically the model if a url is given (e.g FairSeq repository from GitHub).

Parameters:

pretrained_path (str) – Path of the pretrained wav2vec2 model. It can be a url or a local path.
save_path (str) – Path and filename of the downloaded model.
input_norm (bool (default: None)) – If True, a layer_norm (affine) will be applied to the input waveform. By default, it is extracted from the checkpoint of the downloaded model in order to match the pretraining conditions. However, if this information is not given in the checkpoint, it has to be given manually.
output_norm (bool (default: False)) – If True, a layer_norm (affine) will be applied to the output obtained from the wav2vec model.
freeze (bool (default: False)) – If True, the model is frozen. If False, the model will be trained alongside with the rest of the pipeline.
freeze_feature_extractor (bool (default: False)) – Whether to prevent feature extraction weights from updating.
pretrain (bool (default: True)) – If True, the model is pretrained with the specified source. If False, the randomly-initialized model is instantiated.
dropout (float (default: None)) – If different from None (0.0 to 1.0), it will override the given fairseq dropout rates. This is useful if the wav2vec2 model has been trained without dropout and one wants to reactivate it for downstream task fine-tuning (better performance observed).
layer_drop (float (default: None)) – If different from None (0.0 to 1.0), it will override the given fairseq layer_drop rate. This is useful if the wav2vec2 model has been trained without layer_drop and one wants to reactivate it for downstream task fine-tuning.

Example

>>> inputs = torch.rand([10, 600])
>>> model_url = (
...     "https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt"
... )
>>> save_path = "models_checkpoints/wav2vec2.pt"
>>> model = FairseqWav2Vec2(model_url, save_path)
>>> outputs = model(inputs)
>>> outputs.shape
torch.Size([10, 100,  768])

forward(wav, wav_lens)[source]

Takes an input waveform and return its corresponding wav2vec encoding.

Parameters:

wav (torch.Tensor) – A batch of audio signals to transform to features.
wav_lens (torch.Tensor) – The lengths corresponding to the input wavs.

Return type:

wav2vec encoded features.

extract_features(wav, padding_mask=None)[source]: Extracts the wav2vect embeddings

reset_layer(model)[source]: Reinitializes the parameters of the network

remove_pretraining_modules()[source]: Remove unneeded modules. Inspired by the same fairseq function.

make_masks(src, wav_len=None, pad_idx=0)[source]

This method generates the padding masks.

Parameters:

src (tensor) – The sequence to the encoder (required).
wav_len (tensor) – The relative length of the wav given in SpeechBrain format.
pad_idx (int) – The index for <pad> token (default=0).

Returns:

src_key_padding_mask – The mask for removing pad tokens.

Return type:

torch.Tensor

class speechbrain.lobes.models.fairseq_wav2vec.FairseqWav2Vec1(pretrained_path, save_path, output_norm=True, freeze=True, pretrain=True)[source]

Bases: Module

This lobes enables the integration of fairseq pretrained wav2vec1.0 models.

Parameters:

pretrained_path (str) – Path of the pretrained wav2vec1 model. It can be a url or a local path.
save_path (str) – Path and filename of the downloaded model.
output_norm (bool (default: True)) – If True, a layer_norm (affine) will be applied to the output obtained from the wav2vec model.
freeze (bool (default: True)) – If True, the model is frozen. If False, the model will be trained alongside with the rest of the pipeline.
pretrain (bool (default: True)) – If True, the model is pretrained with the specified source. If False, the randomly-initialized model is instantiated.

Example

>>> inputs = torch.rand([10, 600])
>>> model_url = ""
>>> save_path = "models_checkpoints/wav2vec.pt"
>>> model = FairseqWav2Vec1(model_url, save_path)
>>> outputs = model(inputs)
>>> outputs.shape
torch.Size([10, 100, 512])

forward(wav)[source]

Takes an input waveform and return its corresponding wav2vec encoding.

Parameters:: wav (torch.Tensor) – A batch of audio signals to transform to features.
Return type:: wav2vec encoded features

extract_features(wav)[source]: Extracts the wav2vect embeddings

reset_layer(model)[source]: Reinitializes the parameters of the network