speechbrain.lobes.models.wav2vec module

Components necessary to build a wav2vec 2.0 architecture following the original paper: https://arxiv.org/abs/2006.11477.

Authors * Rudolf A Braun 2022 * Guillermo Cambara 2022 * Titouan Parcollet 2022

Summary

Classes:

EncoderWrapper

A wrapper that adds positional information, masks the input and then runs the latent encoder.

W2VLatentExtractor

Convolution based feature extractor from raw audio.

W2VTargetQuantiser

Wraps nnet.quantiser.GumbelVectorQuantizer, see for documentation on arguments.

Functions:

compute_mask

This creates the boolean mask for a target shape which respects the sample lengths and will half roughly mask_prob entries set to True.

sample_negatives

Samples negatives from target tensor y.

w2v_mask_collate_fn

This creates a batch from a list of samples and also creates the boolean mask that will be used to mask the inputs of the latent encoder.

Reference

class speechbrain.lobes.models.wav2vec.W2VLatentExtractor(out_channels=[512, 512, 512, 512, 512, 512, 512], kernel_sizes=[11, 3, 3, 3, 3, 3, 3], strides=[5, 2, 2, 2, 2, 2, 2], dropout=0.0, conv_init='kaiming')[source]

Bases: Module

Convolution based feature extractor from raw audio. Channel numbers increasing is based on https://arxiv.org/abs/2109.06870

Parameters:
  • out_channels (list of ints) – Out channels of convolutional layers.

  • kernel_sizes (list of ints) – Kernels of convolutional layers.

  • strides (list of ints) – Strides of convolutional layers.

  • dropout (float) – Dropout of CNN.

Example

>>> extractor = W2VLatentExtractor()
>>> inputs = torch.rand(10, 5000)
>>> outputs = extractor(inputs)
>>> outputs.shape
torch.Size([10, 14, 512])
forward(x, normalize_signal=True)[source]

Calculates latents from audio input.

get_output_lengths(input_lengths: LongTensor)[source]

Calculates output lengths for given input lengths.

training: bool
class speechbrain.lobes.models.wav2vec.W2VTargetQuantiser(in_dim=512, out_dim=256, quantiser=<class 'speechbrain.nnet.quantisers.GumbelVectorQuantizer'>, num_vars=320, temperature_decay=(2.0, 0.25, 0.999995))[source]

Bases: Module

Wraps nnet.quantiser.GumbelVectorQuantizer, see for documentation on arguments.

Example

>>> quantiser = W2VTargetQuantiser()
>>> inputs = torch.rand(10, 12, 512)
>>> output, meta = quantiser(inputs)
>>> output.shape
torch.Size([10, 12, 256])
forward(x)[source]

Returns quantised targets plus meta information.

training: bool
class speechbrain.lobes.models.wav2vec.EncoderWrapper(in_dim, embedding_dim, latent_encoder, positional_encoding=<class 'speechbrain.lobes.models.transformer.Transformer.PositionalEncoding'>, dropout_encoder_input=0.05)[source]

Bases: Module

A wrapper that adds positional information, masks the input and then runs the latent encoder. :param in_dim: Last dimension of input tensor. :type in_dim: int :param embedding_dim: Dimension to project input to and that the latent encoder will use. :type embedding_dim: int :param latent_encoder: Initialized latent encoder object. :type latent_encoder: torch.nn.module :param positional_encoding: Uninitialized nn.module for adding positional information, will use embedding_dim. :type positional_encoding: torch.nn.module :param dropout_encoder_input: Dropout on encoder input. :type dropout_encoder_input: float

Example

>>> from speechbrain.lobes.models.transformer.Transformer import TransformerEncoder
>>> encoder = TransformerEncoder(d_model=768, num_layers=4, nhead=4, d_ffn=1024)
>>> wrapper = EncoderWrapper(1024, 768, encoder)
>>> inputs = torch.rand(10, 12, 1024)
>>> outputs = wrapper(inputs)
>>> outputs["embeddings"].shape
torch.Size([10, 12, 768])
forward(latents, wav_lens=None, padding_mask=None, mask=None)[source]
Parameters:
  • latents (torch.Tensor, shape (B, T, C)) – Batch of latent representations (AKA frames) output from latent extractor.

  • wav_lens (torch.Tensor, shape (B,)) – The actual (unpadded) relative lengths for each sample of the batch (0<wav_lens<1).

  • padding_mask (Torch.Tensor, shape (B, T,)) – Can be provided instead of wav_lens.

  • mask (torch.Tensor, shape (B, T)) – Boolean mask which decides which latent frames will be masked.

training: bool
speechbrain.lobes.models.wav2vec.compute_mask(shape, sample_lens, mask_prob, mask_length)[source]

This creates the boolean mask for a target shape which respects the sample lengths and will half roughly mask_prob entries set to True.

Parameters:
  • shape (list of ints, like (N, M)) – Shape of boolean mask to return.

  • sample_lens (list of ints) – Absolute lengths of per sample lengths.

  • mask_prob (float) – Percentage to mask.

  • mask_length (int) – Length of contiguous subsequence to mask.

Returns:

mask – Boolean mask with shape of input argument shape.

Return type:

numpy.ndarray

speechbrain.lobes.models.wav2vec.sample_negatives(y, num_neg)[source]

Samples negatives from target tensor y. :param y: Tensor of shape (B, T, C) :type y: torch.Tensor :param num_neg: Number of negatives to sample. :type num_neg: int

Returns:

negs – Negatives in shape (N, B, T, C)

Return type:

torch.Tensor

speechbrain.lobes.models.wav2vec.w2v_mask_collate_fn(samples_lst, get_out_len_fn, mask_prob, mask_length)[source]

This creates a batch from a list of samples and also creates the boolean mask that will be used to mask the inputs of the latent encoder. To create the mask we need to know the output shape after the latent extractor, therefore the argument get_out_len_fn. One could also create masks per sample (when loading the audio file) and then collate them but at that time one doesn’t know the length of the shortest sample in the batch (which determines the number of masked frames) so it’s better this way.

Parameters:
  • samples_lst (list) – List of samples returned by the audio_pipeline.

  • get_out_len_fn (function) – Function that calculates length of sample after it passes through feature extractor.

  • mask_prob (float) – Approximate percentage of frames to mask.

  • mask_length (int) – Number of contiguous frames that will be masked.

Returns:

  • wavs_padded (torch.Tensor, shape (B, T)) – Audio arrays with right-sided padding.

  • wav_lens (torch.Tensor, shape (B,)) – For each sample the percentage of the array that is not padding.

  • mask (torch.Tensor, shape (B, T)) – Boolean mask to mask frames.