speechbrain.lobes.models.wav2vec module
Components necessary to build a wav2vec 2.0 architecture following the original paper: https://arxiv.org/abs/2006.11477.
Authors * Rudolf A Braun 2022 * Guillermo Cambara 2022 * Titouan Parcollet 2022
Summary
Classes:
A wrapper that adds positional information, masks the input and then runs the latent encoder. |
|
Convolution based feature extractor from raw audio. |
|
Wraps |
Functions:
This creates the boolean mask for a target shape which respects the sample lengths and will half roughly |
|
Samples negatives from target tensor y. |
|
This creates a batch from a list of samples and also creates the boolean mask that will be used to mask the inputs of the latent encoder. |
Reference
- class speechbrain.lobes.models.wav2vec.W2VLatentExtractor(out_channels=[512, 512, 512, 512, 512, 512, 512], kernel_sizes=[11, 3, 3, 3, 3, 3, 3], strides=[5, 2, 2, 2, 2, 2, 2], dropout=0.0, conv_init='kaiming')[source]
Bases:
Module
Convolution based feature extractor from raw audio. Channel numbers increasing is based on https://arxiv.org/abs/2109.06870
- Parameters
Example
>>> extractor = W2VLatentExtractor() >>> inputs = torch.rand(10, 5000) >>> outputs = extractor(inputs) >>> outputs.shape torch.Size([10, 14, 512])
- class speechbrain.lobes.models.wav2vec.W2VTargetQuantiser(in_dim=512, out_dim=256, quantiser=<class 'speechbrain.nnet.quantisers.GumbelVectorQuantizer'>, num_vars=320, temperature_decay=(2.0, 0.25, 0.999995))[source]
Bases:
Module
Wraps
nnet.quantiser.GumbelVectorQuantizer
, see for documentation on arguments.Example
>>> quantiser = W2VTargetQuantiser() >>> inputs = torch.rand(10, 12, 512) >>> output, meta = quantiser(inputs) >>> output.shape torch.Size([10, 12, 256])
- class speechbrain.lobes.models.wav2vec.EncoderWrapper(in_dim, embedding_dim, latent_encoder, positional_encoding=<class 'speechbrain.lobes.models.transformer.Transformer.PositionalEncoding'>, dropout_encoder_input=0.05)[source]
Bases:
Module
A wrapper that adds positional information, masks the input and then runs the latent encoder. :param in_dim: Last dimension of input tensor. :type in_dim: int :param embedding_dim: Dimension to project input to and that the latent encoder will use. :type embedding_dim: int :param latent_encoder: Initialized latent encoder object. :type latent_encoder: torch.nn.module :param positional_encoding: Uninitialized nn.module for adding positional information, will use
embedding_dim
. :type positional_encoding: torch.nn.module :param dropout_encoder_input: Dropout on encoder input. :type dropout_encoder_input: floatExample
>>> from speechbrain.lobes.models.transformer.Transformer import TransformerEncoder >>> encoder = TransformerEncoder(d_model=768, num_layers=4, nhead=4, d_ffn=1024) >>> wrapper = EncoderWrapper(1024, 768, encoder) >>> inputs = torch.rand(10, 12, 1024) >>> outputs = wrapper(inputs) >>> outputs["embeddings"].shape torch.Size([10, 12, 768])
- forward(latents, wav_lens=None, padding_mask=None, mask=None)[source]
- Parameters
latents (torch.Tensor, shape (B, T, C)) – Batch of latent representations (AKA frames) output from latent extractor.
wav_lens (torch.Tensor, shape (B,)) – The actual (unpadded) relative lengths for each sample of the batch (0<wav_lens<1).
padding_mask (Torch.Tensor, shape (B, T,)) – Can be provided instead of wav_lens.
mask (torch.Tensor, shape (B, T)) – Boolean mask which decides which latent frames will be masked.
- speechbrain.lobes.models.wav2vec.compute_mask(shape, sample_lens, mask_prob, mask_length)[source]
This creates the boolean mask for a target shape which respects the sample lengths and will half roughly
mask_prob
entries set toTrue
.- Parameters
- Returns
mask – Boolean mask with shape of input argument
shape
.- Return type
- speechbrain.lobes.models.wav2vec.sample_negatives(y, num_neg)[source]
Samples negatives from target tensor y. :param y: Tensor of shape (B, T, C) :type y: torch.Tensor :param num_neg: Number of negatives to sample. :type num_neg: int
- Returns
negs – Negatives in shape (N, B, T, C)
- Return type
- speechbrain.lobes.models.wav2vec.w2v_mask_collate_fn(samples_lst, get_out_len_fn, mask_prob, mask_length)[source]
This creates a batch from a list of samples and also creates the boolean mask that will be used to mask the inputs of the latent encoder. To create the mask we need to know the output shape after the latent extractor, therefore the argument get_out_len_fn. One could also create masks per sample (when loading the audio file) and then collate them but at that time one doesn’t know the length of the shortest sample in the batch (which determines the number of masked frames) so it’s better this way.
- Parameters
samples_lst (list) – List of samples returned by the audio_pipeline.
get_out_len_fn (function) – Function that calculates length of sample after it passes through feature extractor.
mask_prob (float) – Approximate percentage of frames to mask.
mask_length (int) – Number of contiguous frames that will be masked.
- Returns
wavs_padded (torch.Tensor, shape (B, T)) – Audio arrays with right-sided padding.
wav_lens (torch.Tensor, shape (B,)) – For each sample the percentage of the array that is not padding.
mask (torch.Tensor, shape (B, T)) – Boolean mask to mask frames.