speechbrain.lobes.models.beats moduleο
This lobe enables the integration of pretrained BEATs: Audio Pre-Training with Acoustic Tokenizers.
Reference: https://arxiv.org/abs/2212.09058 Based on Github source: https://github.com/microsoft/unilm/tree/master/beats
You could download the checkpoints from: https://github.com/microsoft/unilm/tree/master/beats
- Author
Pooneh Mousavi 2024
Summaryο
Classes:
BEATs: Audio Pre-Training with Acoustic Tokenizers. |
|
Configuration class for the BEATs model. |
|
Implements a Gated Linear Unit (GLU) combined with a linear transformation. |
|
A custom autograd function that scales gradients during the backward pass. |
|
Implements multi-headed attention with support for advanced features like relative position embeddings and gated relative position embedding (GRU-based). |
|
Implements a module that adjusts the padding of a tensor after convolution to maintain its original size, with an option for causal padding. |
|
Implements the Swish activation function as a PyTorch module. |
|
Implements the Transformer Encoder module. |
|
Implements a single Transformer Sentence Encoder layer. |
Functions:
Applies the Gaussian Error Linear Unit (GELU) activation function. |
|
Applies the Gaussian Error Linear Unit (GELU) activation function using an accurate approximation. |
|
Returns the activation function corresponding to the provided activation name. |
|
Initializes weights and biases for modules in the BERT model. |
|
Wraps modules and applies quantization noise to their weights for subsequent quantization using Iterative Product Quantization (iPQ). |
Referenceο
- class speechbrain.lobes.models.beats.BEATs(ckp_path: str = None, freeze: bool = True, output_all_hiddens: bool = False)[source]ο
Bases:
Module
BEATs: Audio Pre-Training with Acoustic Tokenizers.
This class implements the BEATs model, which processes audio signals for feature extraction or downstream tasks. The model supports loading from a checkpoint, applying normalization, and optionally freezing parameters.
- Parameters:
ckp_path (str, optional) β Path to the checkpoint file. If None, the model initializes without pre-trained weights. You could download the checkpoints from : https://github.com/microsoft/unilm/tree/master/beats
freeze (bool, optional (default: False)) β If True, the model parameters are frozen and the model is set to evaluation mode.
output_all_hiddens (bool, optional (default: False)) β If True, the forward function outputs hidden states from all transformer layers. For example BEATs_iter3 has 12 transformer layers and the output is of shape (13, B, T, C), where a projection of the CNN output is added to the beginning. If False, the forward function outputs the hidden states only from the last transformer layer.
Example
>>> audio = torch.randn(4, 10000) # Batch of 4 audio signals >>> length = torch.tensor([1.0, 0.5, 0.75, 1.0]) >>> model = BEATs() >>> outputs = model.extract_features(audio, length)[0] >>> outputs.shape torch.Size([4, 24, 768])
- forward_padding_mask(features: Tensor, padding_mask: Tensor) Tensor [source]ο
Adjusts the padding mask for the given features.
- Parameters:
features (torch.Tensor) β Input features after patch embedding.
padding_mask (torch.Tensor) β Original padding mask for input signals.
- Returns:
Adjusted padding mask.
- Return type:
torch.Tensor
- preprocess(source: Tensor, fbank_mean: float = 15.41663, fbank_std: float = 6.55582) Tensor [source]ο
Preprocesses the input waveform by extracting filter banks and applying normalization.
- Parameters:
- Returns:
Normalized filter banks.
- Return type:
torch.Tensor
- forward(wav: Tensor, wav_lens: Tensor | None = None, fbank_mean: float = 15.41663, fbank_std: float = 6.55582)[source]ο
Takes an input waveform and return its corresponding beats encoding.
- Parameters:
wav (torch.Tensor) β A batch of audio signals to transform to features.
wav_lens (torch.Tensor) β The relative length of the wav given in SpeechBrain format.
fbank_mean (float, optional) β Mean value for filter bank normalization (default: 15.41663).
fbank_std (float, optional) β Standard deviation for filter bank normalization (default: 6.55582).
- Return type:
BEATs encoded features.
- extract_features(wav: Tensor, wav_lens: Tensor | None = None, fbank_mean: float = 15.41663, fbank_std: float = 6.55582) Tensor [source]ο
Extracts features from the input waveform.
- Parameters:
wav (torch.Tensor) β A batch of audio signals to transform to features.
wav_lens (torch.Tensor) β The relative length of the wav given in SpeechBrain format.
fbank_mean (float, optional) β Mean value for filter bank normalization (default: 15.41663).
fbank_std (float, optional) β Standard deviation for filter bank normalization (default: 6.55582).
- Returns:
Extracted features from the BEATs model.
- Return type:
torch.Tensor
- speechbrain.lobes.models.beats.gelu_accurate(x)[source]ο
Applies the Gaussian Error Linear Unit (GELU) activation function using an accurate approximation.
- Parameters:
x (torch.Tensor) β Input tensor on which to apply the GELU activation.
- Returns:
Tensor with GELU activation applied element-wise.
- Return type:
torch.Tensor
- speechbrain.lobes.models.beats.gelu(x: Tensor) Tensor [source]ο
Applies the Gaussian Error Linear Unit (GELU) activation function.
- Parameters:
x (torch.Tensor) β Input tensor to apply the GELU activation.
- Returns:
Tensor with GELU activation applied element-wise.
- Return type:
torch.Tensor
- speechbrain.lobes.models.beats.get_activation_fn(activation: str)[source]ο
Returns the activation function corresponding to the provided activation name.
- Parameters:
activation (str) β Name of the activation function. Supported values: - βreluβ: Applies ReLU activation. - βgeluβ: Applies the GELU activation. - βgelu_fastβ: Alias for
gelu_accurate
with a deprecation warning. - βgelu_accurateβ: Applies the accurate GELU activation. - βtanhβ: Applies the Tanh activation. - βlinearβ: Applies the identity function. - βgluβ: Applies the identity function (GLU placeholder).- Returns:
The corresponding activation function to apply to input tensors.
- Return type:
Callable[[torch.Tensor], torch.Tensor]
- Raises:
RuntimeError β If the specified activation function is not supported.
- class speechbrain.lobes.models.beats.SamePad(kernel_size, causal=False)[source]ο
Bases:
Module
Implements a module that adjusts the padding of a tensor after convolution to maintain its original size, with an option for causal padding.
This is particularly useful for handling padding in convolutional layers where the kernel size or causality affects the output size.
- Parameters:
kernel_size (int) β The size of the convolutional kernel.
causal (bool, optional (default=False)) β If True, applies causal padding by removing
(kernel_size - 1)
elements from the end of the tensor. If False, removes elements to center-align the padding, ensuring the output size matches the input size.
- forward(x)[source]ο
Adjusts the padding of the input tensor
x
.If
self.remove > 0
, the method slices the tensor along the last dimension to remove excess padding based on thekernel_size
andcausal
settings.- Parameters:
x (torch.Tensor) β The input tensor to adjust padding for.
- Returns:
The tensor with adjusted padding.
- Return type:
torch.Tensor
- class speechbrain.lobes.models.beats.Swish[source]ο
Bases:
Module
Implements the Swish activation function as a PyTorch module.
- Swish is a smooth, non-monotonic activation function defined as:
Swish(x) = x * sigmoid(x)
It is often used in deep learning for its ability to improve training performance in certain architectures.
- class speechbrain.lobes.models.beats.GLU_Linear(input_dim, output_dim, glu_type='sigmoid', bias_in_glu=True)[source]ο
Bases:
Module
Implements a Gated Linear Unit (GLU) combined with a linear transformation.
- Parameters:
input_dim (int) β The dimensionality of the input features.
output_dim (int) β The dimensionality of the output features.
glu_type (str, optional (default="sigmoid")) β The type of activation function used for gating. Supported values are: - βsigmoidβ: Uses the sigmoid activation function. - βswishβ: Uses the Swish activation function. - βreluβ: Uses the ReLU activation function. - βgeluβ: Uses the GELU activation function.
bias_in_glu (bool, optional (default=True)) β Whether to include a bias term in the linear transformation.
- class speechbrain.lobes.models.beats.GradMultiply(*args, **kwargs)[source]ο
Bases:
Function
A custom autograd function that scales gradients during the backward pass.
This is useful for scenarios where gradient scaling is required without affecting the forward pass output. The forward pass returns the input as-is, while the backward pass scales the gradients by a specified factor.
- static forward(ctx, x, scale)[source]ο
Performs the forward pass of the GradMultiply function.
- Parameters:
ctx (torch.autograd.Function) β The context object to store information for the backward computation.
x (torch.Tensor) β The input tensor to be forwarded unchanged.
scale (float) β The factor by which the gradients will be scaled during the backward pass.
- Returns:
A new tensor identical to the input tensor.
- Return type:
torch.Tensor
- static backward(ctx, grad)[source]ο
Performs the backward pass, scaling the gradients by the stored factor.
- Parameters:
ctx (torch.autograd.Function) β The context object containing the stored scaling factor.
grad (torch.Tensor) β The gradient tensor from the subsequent layer.
- Returns:
The scaled gradient tensor and None (for the scale input, which has no gradient).
- Return type:
Tuple[torch.Tensor, None]
- speechbrain.lobes.models.beats.quant_noise(module, p, block_size)[source]ο
Wraps modules and applies quantization noise to their weights for subsequent quantization using Iterative Product Quantization (iPQ).
This approach is described in the paper: βTraining with Quantization Noise for Extreme Model Compression.β It introduces quantization noise during training to improve model robustness for extreme weight compression scenarios.
- Parameters:
module (nn.Module) β The module to which quantization noise will be applied. Supported modules are Linear, Embedding, and Conv2d.
p (float) β The amount of quantization noise to apply. Typically a probability or scaling factor.
block_size (int) β The size of the blocks for subsequent quantization with iPQ.
- Return type:
None
- class speechbrain.lobes.models.beats.TransformerEncoder(args)[source]ο
Bases:
Module
Implements the Transformer Encoder module.
- Parameters:
args (Namespace or dict) β A collection of model hyperparameters and configurations.
- forward(x, padding_mask=None, output_all_hiddens=None)[source]ο
Processes the input sequence through the Transformer Encoder layers.
- Parameters:
x (torch.Tensor) β The input tensor of shape
(seq_len, batch_size, embed_dim)
containing the input embeddings.padding_mask (torch.Tensor, optional) β A binary mask of shape
(batch_size, seq_len)
indicating which positions are padding and should be ignored in attention computations. Default isNone
.output_all_hiddens (bool, optional) β If True, returns the hidden states from all encoder layers in addition to the final output. Default is
None
.
- Returns:
The final output tensor of shape
(seq_len, batch_size, embed_dim)
.
- Return type:
Tuple[torch.Tensor, List[torch.Tensor]]
- extract_features(x, padding_mask=None, output_all_hiddens=None)[source]ο
Extracts features from the input sequence using positional convolution, layer normalization, dropout, and a series of Transformer Encoder layers.
- Parameters:
x (torch.Tensor) β The input tensor of shape
(batch_size, seq_len, embed_dim)
containing the input embeddings.padding_mask (torch.Tensor, optional) β A binary mask of shape
(batch_size, seq_len)
indicating which positions are padding and should be ignored in computations. Default isNone
.output_all_hiddens (bool, optional) β If True, collects and returns the hidden states from all encoder layers in addition to the final output. Default is
None
.
- Returns:
The final output tensor of shape
(batch_size, seq_len, embed_dim)
.
- Return type:
Tuple[torch.Tensor, List[torch.Tensor]]
- class speechbrain.lobes.models.beats.TransformerSentenceEncoderLayer(embedding_dim: float = 768, ffn_embedding_dim: float = 3072, num_attention_heads: float = 8, dropout: float = 0.1, attention_dropout: float = 0.1, activation_dropout: float = 0.1, activation_fn: str = 'relu', layer_norm_first: bool = False, deep_norm: bool = False, has_relative_attention_bias: bool = False, num_buckets: int = 0, max_distance: int = 0, rescale_init: bool = False, gru_rel_pos: bool = False, encoder_layers: int = 0)[source]ο
Bases:
Module
Implements a single Transformer Sentence Encoder layer.
- Parameters:
embedding_dim (float, optional (default=768)) β The dimensionality of input embeddings.
ffn_embedding_dim (float, optional (default=3072)) β The dimensionality of the feed-forward networkβs hidden layer.
num_attention_heads (float, optional (default=8)) β The number of attention heads for self-attention.
dropout (float, optional (default=0.1)) β The dropout rate applied to the output of the feed-forward network and attention layers.
attention_dropout (float, optional (default=0.1)) β The dropout rate applied within the attention mechanism.
activation_dropout (float, optional (default=0.1)) β The dropout rate applied after the activation function in the feed-forward network.
activation_fn (str, optional (default="relu")) β The activation function used in the feed-forward network. Supported values include βreluβ and βgeluβ.
layer_norm_first (bool, optional (default=False)) β If True, applies layer normalization before attention and feed-forward layers; otherwise, applies it afterward.
deep_norm (bool, optional (default=False)) β If True, uses deep normalization scaling for residual connections.
has_relative_attention_bias (bool, optional (default=False)) β If True, includes relative position bias in the attention mechanism.
num_buckets (int, optional (default=0)) β The number of buckets used for relative attention bias (if enabled).
max_distance (int, optional (default=0)) β The maximum distance for relative attention bias (if enabled).
rescale_init (bool, optional (default=False)) β If True, rescales parameter initialization for improved stability.
gru_rel_pos (bool, optional (default=False)) β If True, incorporates GRU-style relative position encoding.
encoder_layers (int, optional (default=0)) β The number of encoder layers in the Transformer.
- forward(x: Tensor, self_attn_mask: Tensor = None, self_attn_padding_mask: Tensor = None, need_weights: bool = False, pos_bias=None)[source]ο
Processes the input tensor through the Transformer sentence encoder layer.
- Parameters:
x (torch.Tensor) β Input tensor of shape
(seq_len, batch_size, embed_dim)
.self_attn_mask (torch.Tensor, optional) β Mask for the self-attention mechanism, typically used for causal or padding masking. Default is
None
.self_attn_padding_mask (torch.Tensor, optional) β Padding mask of shape
(batch_size, seq_len)
, indicating which tokens should be ignored in attention computations. Default isNone
.need_weights (bool, optional (default=False)) β Whether to return attention weights. If
True
, attention weights are included in the output.pos_bias (optional) β Positional bias for relative attention, if applicable. Default is
None
.
- Returns:
x
(torch.Tensor): The output tensor of shape(seq_len, batch_size, embed_dim)
after applying the encoder layer.
- Return type:
Tuple[torch.Tensor, torch.Tensor, optional]
- class speechbrain.lobes.models.beats.MultiheadAttention(embed_dim, num_heads, kdim=None, vdim=None, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, self_attention=False, encoder_decoder_attention=False, q_noise=0.0, qn_block_size=8, has_relative_attention_bias=False, num_buckets=32, max_distance=128, gru_rel_pos=False, rescale_init=False)[source]ο
Bases:
Module
Implements multi-headed attention with support for advanced features like relative position embeddings and gated relative position embedding (GRU-based).
- Parameters:
embed_dim (int) β Total number of dimensions for input embeddings.
num_heads (int) β Number of attention heads.
kdim (int, optional) β Dimensionality of key embeddings. Defaults to
embed_dim
.vdim (int, optional) β Dimensionality of value embeddings. Defaults to
embed_dim
.dropout (float, optional) β Dropout probability for attention weights. Defaults to 0.0.
bias (bool, optional) β Whether to include a bias term in projections. Defaults to True.
add_bias_kv (bool, optional) β Whether to include bias for key and value projections. Defaults to False.
add_zero_attn (bool, optional) β Whether to include zero attention vectors. Defaults to False.
self_attention (bool, optional) β Whether the layer is for self-attention. Defaults to False.
encoder_decoder_attention (bool, optional) β Whether the layer is for encoder-decoder attention. Defaults to False.
q_noise (float, optional) β Noise level for quantization. Defaults to 0.0.
qn_block_size (int, optional) β Block size for quantization. Defaults to 8.
has_relative_attention_bias (bool, optional) β Whether to use relative position embeddings. Defaults to False.
num_buckets (int, optional) β Number of buckets for relative position embeddings. Defaults to 32.
max_distance (int, optional) β Maximum distance for relative position embeddings. Defaults to 128.
gru_rel_pos (bool, optional) β Whether to use gated relative position embeddings. Defaults to False.
rescale_init (bool, optional) β Whether to rescale the initialization of weights. Defaults to False.
- reset_parameters()[source]ο
Initializes the weights for the projection layers and relative position embeddings.
- compute_bias(query_length: int, key_length: int) Tensor [source]ο
Computes relative position bias for attention scores.
- forward(query: Tensor, key: Tensor | None, value: Tensor | None, key_padding_mask: Tensor | None = None, incremental_state: Dict[str, Dict[str, Tensor | None]] | None = None, need_weights: bool = True, static_kv: bool = False, attn_mask: Tensor | None = None, before_softmax: bool = False, need_head_weights: bool = False, position_bias: Tensor | None = None) Tuple[Tensor, Tensor | None, Tensor | None] [source]ο
Forward pass for multi-head attention with support for relative position embeddings, caching, and optional dropout.
This method implements the core functionality of multi-head attention with optional features such as relative position bias, incremental decoding, and support for various masking options.
- Parameters:
query (torch.Tensor) β Query tensor of shape
(target_length, batch_size, embed_dim)
.key (torch.Tensor, optional) β Key tensor of shape
(source_length, batch_size, embed_dim)
. Defaults toNone
.value (torch.Tensor, optional) β Value tensor of shape
(source_length, batch_size, embed_dim)
. Defaults toNone
.key_padding_mask (torch.Tensor, optional) β Mask to exclude padding keys, of shape
(batch_size, source_length)
, where padding elements are indicated by 1s. Defaults toNone
.incremental_state (dict, optional) β Stores cached key and value tensors for incremental decoding. Defaults to
None
.need_weights (bool, optional) β If True, returns the attention weights. Defaults to
True
.static_kv (bool, optional) β If True, the key and value tensors remain static for incremental decoding. Defaults to
False
.attn_mask (torch.Tensor, optional) β Attention mask to prevent certain positions from attending, typically for causal attention. Shape:
(target_length, source_length)
. Defaults toNone
.before_softmax (bool, optional) β If True, returns raw attention scores before softmax. Defaults to
False
.need_head_weights (bool, optional) β If True, returns attention weights for each head. Implies
need_weights=True
. Defaults toFalse
.position_bias (torch.Tensor, optional) β Precomputed position bias tensor. If
None
, it is computed during the forward pass.
- Returns:
attn (torch.Tensor) β Attention output of shape
(target_length, batch_size, embed_dim)
.attn_weights (torch.Tensor, optional) β Attention weights of shape
(batch_size, num_heads, target_length, source_length)
, averaged across heads ifneed_head_weights=False
.position_bias (torch.Tensor, optional) β Computed or passed relative position bias of shape
(num_heads, target_length, source_length)
.
- apply_bias(k, v, bsz, attn_mask=None, key_padding_mask=None)[source]ο
Applies bias_k and bias_v to the key and value tensors, updating the attention mask and key padding mask accordingly.
- Parameters:
k (torch.Tensor) β Key tensor.
v (torch.Tensor) β Value tensor.
bsz (int) β Batch size.
attn_mask (torch.Tensor) β Attention mask
key_padding_mask (torch.Tensor) β Key padding mask.
- Returns:
Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor]] β attention mask, and key padding mask.
- Return type:
Updated key, value,
- speechbrain.lobes.models.beats.init_bert_params(module: Module) None [source]ο
Initializes weights and biases for modules in the BERT model.
- Parameters:
module (nn.Module) β The module to initialize. Can be one of
nn.Linear
,nn.Embedding
, orMultiheadAttention
.
- class speechbrain.lobes.models.beats.BEATsConfig(cfg=None)[source]ο
Bases:
object
Configuration class for the BEATs model.
This class defines the configuration for the BEATs model. It provides a default configuration that can be updated with custom settings via the
update
method.- Parameters:
cfg (dict, optional) β A dictionary containing custom configuration values. If provided, it will override the default settings.