speechbrain.lobes.models.wav2vec module

Components necessary to build a wav2vec 2.0 architecture following the original paper: https://arxiv.org/abs/2006.11477.

Authors * Rudolf A Braun 2022 * Guillermo Cambara 2022 * Titouan Parcollet 2022

Summary

Classes:

`EncoderWrapper`	A wrapper that adds positional information, masks the input and then runs the latent encoder.
`W2VLatentExtractor`	Convolution based feature extractor from raw audio.
`W2VTargetQuantiser`	Wraps `nnet.quantiser.GumbelVectorQuantizer`, see for documentation on arguments.

Functions:

`compute_mask`	This creates the boolean mask for a target shape which respects the sample lengths and will half roughly `mask_prob` entries set to `True`.
`sample_negatives`	Samples negatives from target tensor y.
`w2v_mask_collate_fn`	This creates a batch from a list of samples and also creates the boolean mask that will be used to mask the inputs of the latent encoder.

Reference

class speechbrain.lobes.models.wav2vec.W2VLatentExtractor(out_channels=[512, 512, 512, 512, 512, 512, 512], kernel_sizes=[11, 3, 3, 3, 3, 3, 3], strides=[5, 2, 2, 2, 2, 2, 2], dropout=0.0, conv_init='kaiming')[source]

Bases: Module

Convolution based feature extractor from raw audio. Channel numbers increasing is based on https://arxiv.org/abs/2109.06870

Parameters:

out_channels (list of ints) – Out channels of convolutional layers.
kernel_sizes (list of ints) – Kernels of convolutional layers.
strides (list of ints) – Strides of convolutional layers.
dropout (float) – Dropout of CNN.
conv_init (str) – Type of initialization to use, default “kaiming”

Example

>>> extractor = W2VLatentExtractor()
>>> inputs = torch.rand(10, 5000)
>>> outputs = extractor(inputs)
>>> outputs.shape
torch.Size([10, 14, 512])

forward(x, normalize_signal=True)[source]: Calculates latents from audio input.

get_output_lengths(input_lengths: LongTensor)[source]: Calculates output lengths for given input lengths.

class speechbrain.lobes.models.wav2vec.W2VTargetQuantiser(in_dim=512, out_dim=256, quantiser=<class 'speechbrain.nnet.quantisers.GumbelVectorQuantizer'>, num_vars=320, temperature_decay=(2.0, 0.25, 0.999995))[source]

Bases: Module

Wraps nnet.quantiser.GumbelVectorQuantizer, see for documentation on arguments.

Parameters:

in_dim (int) – Input dimension (channels).
out_dim (int) – Output dimension
quantiser (class) – Default GumbelVectorQuantizer
num_vars (int) – Number of quantized vectors per group.
temperature_decay (tuple) – Temperature for training. this should be a tuple of 3 elements: (start, stop, decay factor).

Example

>>> quantiser = W2VTargetQuantiser()
>>> inputs = torch.rand(10, 12, 512)
>>> output, meta = quantiser(inputs)
>>> output.shape
torch.Size([10, 12, 256])

forward(x)[source]: Returns quantised targets plus meta information.

class speechbrain.lobes.models.wav2vec.EncoderWrapper(in_dim, embedding_dim, latent_encoder, positional_encoding=<class 'speechbrain.lobes.models.transformer.Transformer.PositionalEncoding'>, dropout_encoder_input=0.05)[source]

Bases: Module

A wrapper that adds positional information, masks the input and then runs the latent encoder.

Parameters:

in_dim (int) – Last dimension of input tensor.
embedding_dim (int) – Dimension to project input to and that the latent encoder will use.
latent_encoder (torch.nn.module) – Initialized latent encoder object.
positional_encoding (torch.nn.module) – Uninitialized nn.module for adding positional information, will use embedding_dim.
dropout_encoder_input (float) – Dropout on encoder input.

Example

>>> from speechbrain.lobes.models.transformer.Transformer import (
...     TransformerEncoder,
... )
>>> encoder = TransformerEncoder(
...     d_model=768, num_layers=4, nhead=4, d_ffn=1024
... )
>>> wrapper = EncoderWrapper(1024, 768, encoder)
>>> inputs = torch.rand(10, 12, 1024)
>>> outputs = wrapper(inputs)
>>> outputs["embeddings"].shape
torch.Size([10, 12, 768])

forward(latents, wav_lens=None, padding_mask=None, mask=None)[source]

Parameters:

latents (torch.Tensor, shape (B, T, C)) – Batch of latent representations (AKA frames) output from latent extractor.
wav_lens (torch.Tensor, shape (B,)) – The actual (unpadded) relative lengths for each sample of the batch (0<wav_lens<1).
padding_mask (torch.Tensor, shape (B, T,)) – Can be provided instead of wav_lens.
mask (torch.Tensor, shape (B, T)) – Boolean mask which decides which latent frames will be masked.

Returns:

results –

Has the following terms:: ”num_masked” : number of masked terms “ratio_masked” : ratio of masked terms “embeddings” : features

Return type:

dict

speechbrain.lobes.models.wav2vec.compute_mask(shape, sample_lens, mask_prob, mask_length)[source]

This creates the boolean mask for a target shape which respects the sample lengths and will half roughly mask_prob entries set to True.

Parameters:

shape (list of ints, like (N, M)) – Shape of boolean mask to return.
sample_lens (list of ints) – Absolute lengths of per sample lengths.
mask_prob (float) – Percentage to mask.
mask_length (int) – Length of contiguous subsequence to mask.

Returns:

mask – Boolean mask with shape of input argument shape.

Return type:

numpy.ndarray

speechbrain.lobes.models.wav2vec.sample_negatives(y, num_neg)[source]

Samples negatives from target tensor y.

Parameters:

y (torch.Tensor) – Tensor of shape (B, T, C)
num_neg (int) – Number of negatives to sample.

Returns:

negs – Negatives in shape (N, B, T, C)

Return type:

torch.Tensor

speechbrain.lobes.models.wav2vec.w2v_mask_collate_fn(samples_lst, get_out_len_fn, mask_prob, mask_length)[source]

This creates a batch from a list of samples and also creates the boolean mask that will be used to mask the inputs of the latent encoder. To create the mask we need to know the output shape after the latent extractor, therefore the argument get_out_len_fn. One could also create masks per sample (when loading the audio file) and then collate them but at that time one doesn’t know the length of the shortest sample in the batch (which determines the number of masked frames) so it’s better this way.

Parameters:

samples_lst (list) – List of samples returned by the audio_pipeline.
get_out_len_fn (function) – Function that calculates length of sample after it passes through feature extractor.
mask_prob (float) – Approximate percentage of frames to mask.
mask_length (int) – Number of contiguous frames that will be masked.

Returns:

wavs_padded (torch.Tensor, shape (B, T)) – Audio arrays with right-sided padding.
wav_lens (torch.Tensor, shape (B,)) – For each sample the percentage of the array that is not padding.
mask (torch.Tensor, shape (B, T)) – Boolean mask to mask frames.