speechbrain.lobes.models.wav2vec moduleο
Components necessary to build a wav2vec 2.0 architecture following the original paper: https://arxiv.org/abs/2006.11477.
Authors * Rudolf A Braun 2022 * Guillermo Cambara 2022 * Titouan Parcollet 2022
Summaryο
Classes:
A wrapper that adds positional information, masks the input and then runs the latent encoder. |
|
Convolution based feature extractor from raw audio. |
|
Wraps |
Functions:
This creates the boolean mask for a target shape which respects the sample lengths and will half roughly |
|
Samples negatives from target tensor y. |
|
This creates a batch from a list of samples and also creates the boolean mask that will be used to mask the inputs of the latent encoder. |
Referenceο
- class speechbrain.lobes.models.wav2vec.W2VLatentExtractor(out_channels=[512, 512, 512, 512, 512, 512, 512], kernel_sizes=[11, 3, 3, 3, 3, 3, 3], strides=[5, 2, 2, 2, 2, 2, 2], dropout=0.0, conv_init='kaiming')[source]ο
Bases:
ModuleConvolution based feature extractor from raw audio. Channel numbers increasing is based on https://arxiv.org/abs/2109.06870
- Parameters:
out_channels (list of ints) β Out channels of convolutional layers.
kernel_sizes (list of ints) β Kernels of convolutional layers.
strides (list of ints) β Strides of convolutional layers.
dropout (float) β Dropout of CNN.
conv_init (str) β Type of initialization to use, default βkaimingβ
Example
>>> extractor = W2VLatentExtractor() >>> inputs = torch.rand(10, 5000) >>> outputs = extractor(inputs) >>> outputs.shape torch.Size([10, 14, 512])
- class speechbrain.lobes.models.wav2vec.W2VTargetQuantiser(in_dim=512, out_dim=256, quantiser=<class 'speechbrain.nnet.quantisers.GumbelVectorQuantizer'>, num_vars=320, temperature_decay=(2.0, 0.25, 0.999995))[source]ο
Bases:
ModuleWraps
nnet.quantiser.GumbelVectorQuantizer, see for documentation on arguments.- Parameters:
in_dim (int) β Input dimension (channels).
out_dim (int) β Output dimension
quantiser (class) β Default GumbelVectorQuantizer
num_vars (int) β Number of quantized vectors per group.
temperature_decay (tuple) β Temperature for training. this should be a tuple of 3 elements: (start, stop, decay factor).
Example
>>> quantiser = W2VTargetQuantiser() >>> inputs = torch.rand(10, 12, 512) >>> output, meta = quantiser(inputs) >>> output.shape torch.Size([10, 12, 256])
- class speechbrain.lobes.models.wav2vec.EncoderWrapper(in_dim, embedding_dim, latent_encoder, positional_encoding=<class 'speechbrain.lobes.models.transformer.Transformer.PositionalEncoding'>, dropout_encoder_input=0.05)[source]ο
Bases:
ModuleA wrapper that adds positional information, masks the input and then runs the latent encoder.
- Parameters:
in_dim (int) β Last dimension of input tensor.
embedding_dim (int) β Dimension to project input to and that the latent encoder will use.
latent_encoder (torch.nn.module) β Initialized latent encoder object.
positional_encoding (torch.nn.module) β Uninitialized nn.module for adding positional information, will use
embedding_dim.dropout_encoder_input (float) β Dropout on encoder input.
Example
>>> from speechbrain.lobes.models.transformer.Transformer import ( ... TransformerEncoder, ... ) >>> encoder = TransformerEncoder( ... d_model=768, num_layers=4, nhead=4, d_ffn=1024 ... ) >>> wrapper = EncoderWrapper(1024, 768, encoder) >>> inputs = torch.rand(10, 12, 1024) >>> outputs = wrapper(inputs) >>> outputs["embeddings"].shape torch.Size([10, 12, 768])
- forward(latents, wav_lens=None, padding_mask=None, mask=None)[source]ο
- Parameters:
latents (torch.Tensor, shape (B, T, C)) β Batch of latent representations (AKA frames) output from latent extractor.
wav_lens (torch.Tensor, shape (B,)) β The actual (unpadded) relative lengths for each sample of the batch (0<wav_lens<1).
padding_mask (torch.Tensor, shape (B, T,)) β Can be provided instead of wav_lens.
mask (torch.Tensor, shape (B, T)) β Boolean mask which decides which latent frames will be masked.
- Returns:
results β
- Has the following terms:
βnum_maskedβ : number of masked terms βratio_maskedβ : ratio of masked terms βembeddingsβ : features
- Return type:
- speechbrain.lobes.models.wav2vec.compute_mask(shape, sample_lens, mask_prob, mask_length)[source]ο
This creates the boolean mask for a target shape which respects the sample lengths and will half roughly
mask_probentries set toTrue.- Parameters:
- Returns:
mask β Boolean mask with shape of input argument
shape.- Return type:
- speechbrain.lobes.models.wav2vec.sample_negatives(y, num_neg)[source]ο
Samples negatives from target tensor y.
- Parameters:
y (torch.Tensor) β Tensor of shape (B, T, C)
num_neg (int) β Number of negatives to sample.
- Returns:
negs β Negatives in shape (N, B, T, C)
- Return type:
- speechbrain.lobes.models.wav2vec.w2v_mask_collate_fn(samples_lst, get_out_len_fn, mask_prob, mask_length)[source]ο
This creates a batch from a list of samples and also creates the boolean mask that will be used to mask the inputs of the latent encoder. To create the mask we need to know the output shape after the latent extractor, therefore the argument
get_out_len_fn. One could also create masks per sample (when loading the audio file) and then collate them but at that time one doesnβt know the length of the shortest sample in the batch (which determines the number of masked frames) so itβs better this way.- Parameters:
samples_lst (list) β List of samples returned by the audio_pipeline.
get_out_len_fn (function) β Function that calculates length of sample after it passes through feature extractor.
mask_prob (float) β Approximate percentage of frames to mask.
mask_length (int) β Number of contiguous frames that will be masked.
- Returns:
wavs_padded (torch.Tensor, shape (B, T)) β Audio arrays with right-sided padding.
wav_lens (torch.Tensor, shape (B,)) β For each sample the percentage of the array that is not padding.
mask (torch.Tensor, shape (B, T)) β Boolean mask to mask frames.