speechbrain.lobes.models.beats module

This lobe enables the integration of pretrained BEATs: Audio Pre-Training with Acoustic Tokenizers.

Reference: https://arxiv.org/abs/2212.09058 Based on Github source: https://github.com/microsoft/unilm/tree/master/beats

You could download the checkpoints from: https://github.com/microsoft/unilm/tree/master/beats

Author
  • Pooneh Mousavi 2024

Summary

Classes:

BEATs

BEATs: Audio Pre-Training with Acoustic Tokenizers.

BEATsConfig

Configuration class for the BEATs model.

GLU_Linear

Implements a Gated Linear Unit (GLU) combined with a linear transformation.

GradMultiply

A custom autograd function that scales gradients during the backward pass.

MultiheadAttention

Implements multi-headed attention with support for advanced features like relative position embeddings and gated relative position embedding (GRU-based).

SamePad

Implements a module that adjusts the padding of a tensor after convolution to maintain its original size, with an option for causal padding.

Swish

Implements the Swish activation function as a PyTorch module.

TransformerEncoder

Implements the Transformer Encoder module.

TransformerSentenceEncoderLayer

Implements a single Transformer Sentence Encoder layer.

Functions:

gelu

Applies the Gaussian Error Linear Unit (GELU) activation function.

gelu_accurate

Applies the Gaussian Error Linear Unit (GELU) activation function using an accurate approximation.

get_activation_fn

Returns the activation function corresponding to the provided activation name.

init_bert_params

Initializes weights and biases for modules in the BERT model.

quant_noise

Wraps modules and applies quantization noise to their weights for subsequent quantization using Iterative Product Quantization (iPQ).

Reference

class speechbrain.lobes.models.beats.BEATs(ckp_path: str = None, freeze: bool = True, output_all_hiddens: bool = False)[source]

Bases: Module

BEATs: Audio Pre-Training with Acoustic Tokenizers.

This class implements the BEATs model, which processes audio signals for feature extraction or downstream tasks. The model supports loading from a checkpoint, applying normalization, and optionally freezing parameters.

Parameters:
  • ckp_path (str, optional) – Path to the checkpoint file. If None, the model initializes without pre-trained weights. You could download the checkpoints from : https://github.com/microsoft/unilm/tree/master/beats

  • freeze (bool, optional (default: False)) – If True, the model parameters are frozen and the model is set to evaluation mode.

  • output_all_hiddens (bool, optional (default: False)) – If True, the forward function outputs hidden states from all transformer layers. For example BEATs_iter3 has 12 transformer layers and the output is of shape (13, B, T, C), where a projection of the CNN output is added to the beginning. If False, the forward function outputs the hidden states only from the last transformer layer.

Example

>>> audio = torch.randn(4, 10000)  # Batch of 4 audio signals
>>> length = torch.tensor([1.0, 0.5, 0.75, 1.0])
>>> model = BEATs()
>>> outputs = model.extract_features(audio, length)[0]
>>> outputs.shape
torch.Size([4, 24, 768])
forward_padding_mask(features: Tensor, padding_mask: Tensor) Tensor[source]

Adjusts the padding mask for the given features.

Parameters:
  • features (torch.Tensor) – Input features after patch embedding.

  • padding_mask (torch.Tensor) – Original padding mask for input signals.

Returns:

Adjusted padding mask.

Return type:

torch.Tensor

preprocess(source: Tensor, fbank_mean: float = 15.41663, fbank_std: float = 6.55582) Tensor[source]

Preprocesses the input waveform by extracting filter banks and applying normalization.

Parameters:
  • source (torch.Tensor) – Input waveform signals.

  • fbank_mean (float, optional) – Mean value for filter bank normalization (default: 15.41663).

  • fbank_std (float, optional) – Standard deviation for filter bank normalization (default: 6.55582).

Returns:

Normalized filter banks.

Return type:

torch.Tensor

forward(wav: Tensor, wav_lens: Tensor | None = None, fbank_mean: float = 15.41663, fbank_std: float = 6.55582)[source]

Takes an input waveform and return its corresponding beats encoding.

Parameters:
  • wav (torch.Tensor) – A batch of audio signals to transform to features.

  • wav_lens (torch.Tensor) – The relative length of the wav given in SpeechBrain format.

  • fbank_mean (float, optional) – Mean value for filter bank normalization (default: 15.41663).

  • fbank_std (float, optional) – Standard deviation for filter bank normalization (default: 6.55582).

Return type:

BEATs encoded features.

extract_features(wav: Tensor, wav_lens: Tensor | None = None, fbank_mean: float = 15.41663, fbank_std: float = 6.55582) Tensor[source]

Extracts features from the input waveform.

Parameters:
  • wav (torch.Tensor) – A batch of audio signals to transform to features.

  • wav_lens (torch.Tensor) – The relative length of the wav given in SpeechBrain format.

  • fbank_mean (float, optional) – Mean value for filter bank normalization (default: 15.41663).

  • fbank_std (float, optional) – Standard deviation for filter bank normalization (default: 6.55582).

Returns:

Extracted features from the BEATs model.

Return type:

torch.Tensor

speechbrain.lobes.models.beats.gelu_accurate(x)[source]

Applies the Gaussian Error Linear Unit (GELU) activation function using an accurate approximation.

Parameters:

x (torch.Tensor) – Input tensor on which to apply the GELU activation.

Returns:

Tensor with GELU activation applied element-wise.

Return type:

torch.Tensor

speechbrain.lobes.models.beats.gelu(x: Tensor) Tensor[source]

Applies the Gaussian Error Linear Unit (GELU) activation function.

Parameters:

x (torch.Tensor) – Input tensor to apply the GELU activation.

Returns:

Tensor with GELU activation applied element-wise.

Return type:

torch.Tensor

speechbrain.lobes.models.beats.get_activation_fn(activation: str)[source]

Returns the activation function corresponding to the provided activation name.

Parameters:

activation (str) – Name of the activation function. Supported values: - β€œrelu”: Applies ReLU activation. - β€œgelu”: Applies the GELU activation. - β€œgelu_fast”: Alias for gelu_accurate with a deprecation warning. - β€œgelu_accurate”: Applies the accurate GELU activation. - β€œtanh”: Applies the Tanh activation. - β€œlinear”: Applies the identity function. - β€œglu”: Applies the identity function (GLU placeholder).

Returns:

The corresponding activation function to apply to input tensors.

Return type:

Callable[[torch.Tensor], torch.Tensor]

Raises:

RuntimeError – If the specified activation function is not supported.

class speechbrain.lobes.models.beats.SamePad(kernel_size, causal=False)[source]

Bases: Module

Implements a module that adjusts the padding of a tensor after convolution to maintain its original size, with an option for causal padding.

This is particularly useful for handling padding in convolutional layers where the kernel size or causality affects the output size.

Parameters:
  • kernel_size (int) – The size of the convolutional kernel.

  • causal (bool, optional (default=False)) – If True, applies causal padding by removing (kernel_size - 1) elements from the end of the tensor. If False, removes elements to center-align the padding, ensuring the output size matches the input size.

forward(x)[source]

Adjusts the padding of the input tensor x.

If self.remove > 0, the method slices the tensor along the last dimension to remove excess padding based on the kernel_size and causal settings.

Parameters:

x (torch.Tensor) – The input tensor to adjust padding for.

Returns:

The tensor with adjusted padding.

Return type:

torch.Tensor

class speechbrain.lobes.models.beats.Swish[source]

Bases: Module

Implements the Swish activation function as a PyTorch module.

Swish is a smooth, non-monotonic activation function defined as:

Swish(x) = x * sigmoid(x)

It is often used in deep learning for its ability to improve training performance in certain architectures.

forward(x)[source]

Applies the Swish activation function to the input tensor.

Parameters:

x (torch.Tensor) – The input tensor to which the Swish activation is applied.

Returns:

The input tensor after applying the Swish activation.

Return type:

torch.Tensor

class speechbrain.lobes.models.beats.GLU_Linear(input_dim, output_dim, glu_type='sigmoid', bias_in_glu=True)[source]

Bases: Module

Implements a Gated Linear Unit (GLU) combined with a linear transformation.

Parameters:
  • input_dim (int) – The dimensionality of the input features.

  • output_dim (int) – The dimensionality of the output features.

  • glu_type (str, optional (default="sigmoid")) – The type of activation function used for gating. Supported values are: - β€œsigmoid”: Uses the sigmoid activation function. - β€œswish”: Uses the Swish activation function. - β€œrelu”: Uses the ReLU activation function. - β€œgelu”: Uses the GELU activation function.

  • bias_in_glu (bool, optional (default=True)) – Whether to include a bias term in the linear transformation.

class speechbrain.lobes.models.beats.GradMultiply(*args, **kwargs)[source]

Bases: Function

A custom autograd function that scales gradients during the backward pass.

This is useful for scenarios where gradient scaling is required without affecting the forward pass output. The forward pass returns the input as-is, while the backward pass scales the gradients by a specified factor.

static forward(ctx, x, scale)[source]

Performs the forward pass of the GradMultiply function.

Parameters:
  • ctx (torch.autograd.Function) – The context object to store information for the backward computation.

  • x (torch.Tensor) – The input tensor to be forwarded unchanged.

  • scale (float) – The factor by which the gradients will be scaled during the backward pass.

Returns:

A new tensor identical to the input tensor.

Return type:

torch.Tensor

static backward(ctx, grad)[source]

Performs the backward pass, scaling the gradients by the stored factor.

Parameters:
  • ctx (torch.autograd.Function) – The context object containing the stored scaling factor.

  • grad (torch.Tensor) – The gradient tensor from the subsequent layer.

Returns:

The scaled gradient tensor and None (for the scale input, which has no gradient).

Return type:

Tuple[torch.Tensor, None]

speechbrain.lobes.models.beats.quant_noise(module, p, block_size)[source]

Wraps modules and applies quantization noise to their weights for subsequent quantization using Iterative Product Quantization (iPQ).

This approach is described in the paper: β€œTraining with Quantization Noise for Extreme Model Compression.” It introduces quantization noise during training to improve model robustness for extreme weight compression scenarios.

Parameters:
  • module (nn.Module) – The module to which quantization noise will be applied. Supported modules are Linear, Embedding, and Conv2d.

  • p (float) – The amount of quantization noise to apply. Typically a probability or scaling factor.

  • block_size (int) – The size of the blocks for subsequent quantization with iPQ.

Return type:

None

class speechbrain.lobes.models.beats.TransformerEncoder(args)[source]

Bases: Module

Implements the Transformer Encoder module.

Parameters:

args (Namespace or dict) – A collection of model hyperparameters and configurations.

forward(x, padding_mask=None, output_all_hiddens=None)[source]

Processes the input sequence through the Transformer Encoder layers.

Parameters:
  • x (torch.Tensor) – The input tensor of shape (seq_len, batch_size, embed_dim) containing the input embeddings.

  • padding_mask (torch.Tensor, optional) – A binary mask of shape (batch_size, seq_len) indicating which positions are padding and should be ignored in attention computations. Default is None.

  • output_all_hiddens (bool, optional) – If True, returns the hidden states from all encoder layers in addition to the final output. Default is None.

Returns:

  • The final output tensor of shape (seq_len, batch_size, embed_dim).

Return type:

Tuple[torch.Tensor, List[torch.Tensor]]

extract_features(x, padding_mask=None, output_all_hiddens=None)[source]

Extracts features from the input sequence using positional convolution, layer normalization, dropout, and a series of Transformer Encoder layers.

Parameters:
  • x (torch.Tensor) – The input tensor of shape (batch_size, seq_len, embed_dim) containing the input embeddings.

  • padding_mask (torch.Tensor, optional) – A binary mask of shape (batch_size, seq_len) indicating which positions are padding and should be ignored in computations. Default is None.

  • output_all_hiddens (bool, optional) – If True, collects and returns the hidden states from all encoder layers in addition to the final output. Default is None.

Returns:

  • The final output tensor of shape (batch_size, seq_len, embed_dim).

Return type:

Tuple[torch.Tensor, List[torch.Tensor]]

class speechbrain.lobes.models.beats.TransformerSentenceEncoderLayer(embedding_dim: float = 768, ffn_embedding_dim: float = 3072, num_attention_heads: float = 8, dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.1, activation_fn: str = 'relu', layer_norm_first: bool = False, deep_norm: bool = False, has_relative_attention_bias: bool = False, num_buckets: int = 0, max_distance: int = 0, rescale_init: bool = False, gru_rel_pos: bool = False, encoder_layers: int = 0)[source]

Bases: Module

Implements a single Transformer Sentence Encoder layer.

Parameters:
  • embedding_dim (float, optional (default=768)) – The dimensionality of input embeddings.

  • ffn_embedding_dim (float, optional (default=3072)) – The dimensionality of the feed-forward network’s hidden layer.

  • num_attention_heads (float, optional (default=8)) – The number of attention heads for self-attention.

  • dropout (float, optional (default=0.1)) – The dropout rate applied to the output of the feed-forward network and attention layers.

  • attention_dropout (float, optional (default=0.1)) – The dropout rate applied within the attention mechanism.

  • activation_dropout (float, optional (default=0.1)) – The dropout rate applied after the activation function in the feed-forward network.

  • activation_fn (str, optional (default="relu")) – The activation function used in the feed-forward network. Supported values include β€œrelu” and β€œgelu”.

  • layer_norm_first (bool, optional (default=False)) – If True, applies layer normalization before attention and feed-forward layers; otherwise, applies it afterward.

  • deep_norm (bool, optional (default=False)) – If True, uses deep normalization scaling for residual connections.

  • has_relative_attention_bias (bool, optional (default=False)) – If True, includes relative position bias in the attention mechanism.

  • num_buckets (int, optional (default=0)) – The number of buckets used for relative attention bias (if enabled).

  • max_distance (int, optional (default=0)) – The maximum distance for relative attention bias (if enabled).

  • rescale_init (bool, optional (default=False)) – If True, rescales parameter initialization for improved stability.

  • gru_rel_pos (bool, optional (default=False)) – If True, incorporates GRU-style relative position encoding.

  • encoder_layers (int, optional (default=0)) – The number of encoder layers in the Transformer.

forward(x: Tensor, self_attn_mask: Tensor = None, self_attn_padding_mask: Tensor = None, need_weights: bool = False, pos_bias=None)[source]

Processes the input tensor through the Transformer sentence encoder layer.

Parameters:
  • x (torch.Tensor) – Input tensor of shape (seq_len, batch_size, embed_dim).

  • self_attn_mask (torch.Tensor, optional) – Mask for the self-attention mechanism, typically used for causal or padding masking. Default is None.

  • self_attn_padding_mask (torch.Tensor, optional) – Padding mask of shape (batch_size, seq_len), indicating which tokens should be ignored in attention computations. Default is None.

  • need_weights (bool, optional (default=False)) – Whether to return attention weights. If True, attention weights are included in the output.

  • pos_bias (optional) – Positional bias for relative attention, if applicable. Default is None.

Returns:

  • x (torch.Tensor): The output tensor of shape (seq_len, batch_size, embed_dim)

after applying the encoder layer.

Return type:

Tuple[torch.Tensor, torch.Tensor, optional]

class speechbrain.lobes.models.beats.MultiheadAttention(embed_dim, num_heads, kdim=None, vdim=None, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, self_attention=False, encoder_decoder_attention=False, q_noise=0.0, qn_block_size=8, has_relative_attention_bias=False, num_buckets=32, max_distance=128, gru_rel_pos=False, rescale_init=False)[source]

Bases: Module

Implements multi-headed attention with support for advanced features like relative position embeddings and gated relative position embedding (GRU-based).

Parameters:
  • embed_dim (int) – Total number of dimensions for input embeddings.

  • num_heads (int) – Number of attention heads.

  • kdim (int, optional) – Dimensionality of key embeddings. Defaults to embed_dim.

  • vdim (int, optional) – Dimensionality of value embeddings. Defaults to embed_dim.

  • dropout (float, optional) – Dropout probability for attention weights. Defaults to 0.0.

  • bias (bool, optional) – Whether to include a bias term in projections. Defaults to True.

  • add_bias_kv (bool, optional) – Whether to include bias for key and value projections. Defaults to False.

  • add_zero_attn (bool, optional) – Whether to include zero attention vectors. Defaults to False.

  • self_attention (bool, optional) – Whether the layer is for self-attention. Defaults to False.

  • encoder_decoder_attention (bool, optional) – Whether the layer is for encoder-decoder attention. Defaults to False.

  • q_noise (float, optional) – Noise level for quantization. Defaults to 0.0.

  • qn_block_size (int, optional) – Block size for quantization. Defaults to 8.

  • has_relative_attention_bias (bool, optional) – Whether to use relative position embeddings. Defaults to False.

  • num_buckets (int, optional) – Number of buckets for relative position embeddings. Defaults to 32.

  • max_distance (int, optional) – Maximum distance for relative position embeddings. Defaults to 128.

  • gru_rel_pos (bool, optional) – Whether to use gated relative position embeddings. Defaults to False.

  • rescale_init (bool, optional) – Whether to rescale the initialization of weights. Defaults to False.

reset_parameters()[source]

Initializes the weights for the projection layers and relative position embeddings.

compute_bias(query_length: int, key_length: int) Tensor[source]

Computes relative position bias for attention scores.

Parameters:
  • query_length (int) – The length of the query sequence.

  • key_length (int) – The length of the key sequence.

Returns:

A tensor of shape (num_heads, query_length, key_length) containing the relative position bias values for each attention head.

Return type:

torch.Tensor

forward(query: Tensor, key: Tensor | None, value: Tensor | None, key_padding_mask: Tensor | None = None, incremental_state: Dict[str, Dict[str, Tensor | None]] | None = None, need_weights: bool = True, static_kv: bool = False, attn_mask: Tensor | None = None, before_softmax: bool = False, need_head_weights: bool = False, position_bias: Tensor | None = None) Tuple[Tensor, Tensor | None, Tensor | None][source]

Forward pass for multi-head attention with support for relative position embeddings, caching, and optional dropout.

This method implements the core functionality of multi-head attention with optional features such as relative position bias, incremental decoding, and support for various masking options.

Parameters:
  • query (torch.Tensor) – Query tensor of shape (target_length, batch_size, embed_dim).

  • key (torch.Tensor, optional) – Key tensor of shape (source_length, batch_size, embed_dim). Defaults to None.

  • value (torch.Tensor, optional) – Value tensor of shape (source_length, batch_size, embed_dim). Defaults to None.

  • key_padding_mask (torch.Tensor, optional) – Mask to exclude padding keys, of shape (batch_size, source_length), where padding elements are indicated by 1s. Defaults to None.

  • incremental_state (dict, optional) – Stores cached key and value tensors for incremental decoding. Defaults to None.

  • need_weights (bool, optional) – If True, returns the attention weights. Defaults to True.

  • static_kv (bool, optional) – If True, the key and value tensors remain static for incremental decoding. Defaults to False.

  • attn_mask (torch.Tensor, optional) – Attention mask to prevent certain positions from attending, typically for causal attention. Shape: (target_length, source_length). Defaults to None.

  • before_softmax (bool, optional) – If True, returns raw attention scores before softmax. Defaults to False.

  • need_head_weights (bool, optional) – If True, returns attention weights for each head. Implies need_weights=True. Defaults to False.

  • position_bias (torch.Tensor, optional) – Precomputed position bias tensor. If None, it is computed during the forward pass.

Returns:

  • attn (torch.Tensor) – Attention output of shape (target_length, batch_size, embed_dim).

  • attn_weights (torch.Tensor, optional) – Attention weights of shape (batch_size, num_heads, target_length, source_length), averaged across heads if need_head_weights=False.

  • position_bias (torch.Tensor, optional) – Computed or passed relative position bias of shape (num_heads, target_length, source_length).

apply_bias(k, v, bsz, attn_mask=None, key_padding_mask=None)[source]

Applies bias_k and bias_v to the key and value tensors, updating the attention mask and key padding mask accordingly.

Parameters:
  • k (torch.Tensor) – Key tensor.

  • v (torch.Tensor) – Value tensor.

  • bsz (int) – Batch size.

  • attn_mask (torch.Tensor) – Attention mask

  • key_padding_mask (torch.Tensor) – Key padding mask.

Returns:

Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor]] – attention mask, and key padding mask.

Return type:

Updated key, value,

apply_sparse_mask(attn_weights, tgt_len: int, src_len: int, bsz: int)[source]

Applies a sparse mask to the attention weights.

Parameters:
  • attn_weights (torch.Tensor) – The attention weights tensor of shape (batch_size * num_heads, tgt_len, src_len).

  • tgt_len (int) – The target sequence length.

  • src_len (int) – The source sequence length.

  • bsz (int) – The batch size.

Returns:

The (potentially modified) attention weights tensor. By default, this is the same as the input tensor.

Return type:

torch.Tensor

speechbrain.lobes.models.beats.init_bert_params(module: Module) None[source]

Initializes weights and biases for modules in the BERT model.

Parameters:

module (nn.Module) – The module to initialize. Can be one of nn.Linear, nn.Embedding, or MultiheadAttention.

class speechbrain.lobes.models.beats.BEATsConfig(cfg=None)[source]

Bases: object

Configuration class for the BEATs model.

This class defines the configuration for the BEATs model. It provides a default configuration that can be updated with custom settings via the update method.

Parameters:

cfg (dict, optional) – A dictionary containing custom configuration values. If provided, it will override the default settings.

update(cfg: dict)[source]

Updates the instance’s attributes with key-value pairs from a given configuration dictionary.

Parameters:

cfg (dict) – A dictionary containing the configuration values to update the instance with.