speechbrain.lobes.models.transformer.Conformer module

Conformer implementation.

Authors

Jianyuan Zhong 2020
Samuele Cornell 2021
Sylvain de Langen 2023

Summary

Classes:

`ConformerDecoder`	This class implements the Transformer decoder.
`ConformerDecoderLayer`	This is an implementation of Conformer encoder layer.
`ConformerEncoder`	This class implements the Conformer encoder.
`ConformerEncoderLayer`	This is an implementation of Conformer encoder layer.
`ConformerEncoderLayerStreamingContext`	Streaming metadata and state for a `ConformerEncoderLayer`.
`ConformerEncoderStreamingContext`	Streaming metadata and state for a `ConformerEncoder`.
`ConvolutionModule`	This is an implementation of convolution module in Conformer.

Reference

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoderLayerStreamingContext(mha_left_context_size: int, mha_left_context: Tensor | None = None, dcconv_left_context: Tensor | None = None)[source]

Bases: object

Streaming metadata and state for a ConformerEncoderLayer.

The multi-head attention and Dynamic Chunk Convolution require to save some left context that gets inserted as left padding.

See ConvolutionModule documentation for further details.

mha_left_context_size: int: For this layer, specifies how many frames of inputs should be saved. Usually, the same value is used across all layers, but this can be modified.

mha_left_context: Tensor | None = None: Left context to insert at the left of the current chunk as inputs to the multi-head attention. It can be None (if we’re dealing with the first chunk) or <= mha_left_context_size because for the first few chunks, not enough left context may be available to pad.

dcconv_left_context: Tensor | None = None

Left context to insert at the left of the convolution according to the Dynamic Chunk Convolution method.

Unlike mha_left_context, here the amount of frames to keep is fixed and inferred from the kernel size of the convolution module.

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoderStreamingContext(dynchunktrain_config: DynChunkTrainConfig, layers: List[ConformerEncoderLayerStreamingContext])[source]

Bases: object

Streaming metadata and state for a ConformerEncoder.

dynchunktrain_config: DynChunkTrainConfig: Dynamic Chunk Training configuration holding chunk size and context size information.

layers: List[ConformerEncoderLayerStreamingContext]: Streaming metadata and state for each layer of the encoder.

class speechbrain.lobes.models.transformer.Conformer.ConvolutionModule(input_size, kernel_size=31, bias=True, activation=<class 'speechbrain.nnet.activations.Swish'>, dropout=0.0, causal=False, dilation=1)[source]

Bases: Module

This is an implementation of convolution module in Conformer.

Parameters:

input_size (int) – The expected size of the input embedding dimension.
kernel_size (int, optional) – Kernel size of non-bottleneck convolutional layer.
bias (bool, optional) – Whether to use bias in the non-bottleneck conv layer.
activation (torch.nn.Module) – Activation function used after non-bottleneck conv layer.
dropout (float, optional) – Dropout rate.
causal (bool, optional) – Whether the convolution should be causal or not.
dilation (int, optional) – Dilation factor for the non bottleneck conv layer.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> net = ConvolutionModule(512, 3)
>>> output = net(x)
>>> output.shape
torch.Size([8, 60, 512])

forward(x: Tensor, mask: Tensor | None = None, dynchunktrain_config: DynChunkTrainConfig | None = None)[source]

Applies the convolution to an input tensor x.

Parameters:

x (torch.Tensor) – Input tensor to the convolution module.
mask (torch.Tensor, optional) – Mask to be applied over the output of the convolution using masked_fill_, if specified.
dynchunktrain_config (DynChunkTrainConfig, optional) – If specified, makes the module support Dynamic Chunk Convolution (DCConv) as implemented by Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR. This allows masking future frames while preserving better accuracy than a fully causal convolution, at a small speed cost. This should only be used for training (or, if you know what you’re doing, for masked evaluation at inference time), as the forward streaming function should be used at inference time.

training: bool

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoderLayer(d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL')[source]

Bases: Module

This is an implementation of Conformer encoder layer.

Parameters:

d_model (int) – The expected size of the input embedding.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
nhead (int) – Number of attention heads.
kernel_size (int, optional) – Kernel size of convolution model.
kdim (int, optional) – Dimension of the key.
vdim (int, optional) – Dimension of the value.
activation (torch.nn.Module) – Activation function used in each Conformer layer.
bias (bool, optional) – Whether convolution module.
dropout (int, optional) – Dropout for the encoder.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_embs = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3)
>>> output = net(x, pos_embs=pos_embs)
>>> output[0].shape
torch.Size([8, 60, 512])

forward(x, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor | None = None, dynchunktrain_config: DynChunkTrainConfig | None = None)[source]

Parameters:

src (torch.Tensor) – The sequence to the encoder layer.
src_mask (torch.Tensor, optional) – The mask for the src sequence.
src_key_padding_mask (torch.Tensor, optional) – The mask for the src keys per batch.
pos_embs (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the input sequence positional embeddings
dynchunktrain_config (Optional[DynChunkTrainConfig]) – Dynamic Chunk Training configuration object for streaming, specifically involved here to apply Dynamic Chunk Convolution to the convolution module.

forward_streaming(x, context: ConformerEncoderLayerStreamingContext, pos_embs: Tensor | None = None)[source]

Conformer layer streaming forward (typically for DynamicChunkTraining-trained models), which is to be used at inference time. Relies on a mutable context object as initialized by make_streaming_context that should be used across chunks. Invoked by ConformerEncoder.forward_streaming.

Parameters:

x (torch.Tensor) – Input tensor for this layer. Batching is supported as long as you keep the context consistent.
context (ConformerEncoderStreamingContext) – Mutable streaming context; the same object should be passed across calls.
pos_embs (torch.Tensor, optional) – Positional embeddings, if used.

make_streaming_context(mha_left_context_size: int)[source]

Creates a blank streaming context for this encoding layer.

Parameters:: mha_left_context_size (int) – How many left frames should be saved and used as left context to the current chunk when streaming

training: bool

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoder(num_layers, d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL')[source]

Bases: Module

This class implements the Conformer encoder.

Parameters:

num_layers (int) – Number of layers.
d_model (int) – Embedding dimension size.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
nhead (int) – Number of attention heads.
kernel_size (int, optional) – Kernel size of convolution model.
kdim (int, optional) – Dimension of the key.
vdim (int, optional) – Dimension of the value.
activation (torch.nn.Module) – Activation function used in each Confomer layer.
bias (bool, optional) – Whether convolution module.
dropout (int, optional) – Dropout for the encoder.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_emb = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoder(1, 512, 512, 8)
>>> output, _ = net(x, pos_embs=pos_emb)
>>> output.shape
torch.Size([8, 60, 512])

forward(src, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor | None = None, dynchunktrain_config: DynChunkTrainConfig | None = None)[source]

Parameters:

src (torch.Tensor) – The sequence to the encoder layer.
src_mask (torch.Tensor, optional) – The mask for the src sequence.
src_key_padding_mask (torch.Tensor, optional) – The mask for the src keys per batch.
pos_embs (torch.Tensor, torch.nn.Module,) – Module or tensor containing the input sequence positional embeddings If custom pos_embs are given it needs to have the shape (1, 2*S-1, E) where S is the sequence length, and E is the embedding dimension.
dynchunktrain_config (Optional[DynChunkTrainConfig]) – Dynamic Chunk Training configuration object for streaming, specifically involved here to apply Dynamic Chunk Convolution to the convolution module.

forward_streaming(src: Tensor, context: ConformerEncoderStreamingContext, pos_embs: Tensor | None = None)[source]

Conformer streaming forward (typically for DynamicChunkTraining-trained models), which is to be used at inference time. Relies on a mutable context object as initialized by make_streaming_context that should be used across chunks.

Parameters:

src (torch.Tensor) – Input tensor. Batching is supported as long as you keep the context consistent.
context (ConformerEncoderStreamingContext) – Mutable streaming context; the same object should be passed across calls.
pos_embs (torch.Tensor, optional) – Positional embeddings, if used.

make_streaming_context(dynchunktrain_config: DynChunkTrainConfig)[source]

Creates a blank streaming context for the encoder.

Parameters:

dynchunktrain_config (Optional[DynChunkTrainConfig]) – Dynamic Chunk Training configuration object for streaming
mha_left_context_size (int) – How many left frames should be saved and used as left context to the current chunk when streaming. This value is replicated across all layers.

training: bool

class speechbrain.lobes.models.transformer.Conformer.ConformerDecoderLayer(d_model, d_ffn, nhead, kernel_size, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=True, attention_type='RelPosMHAXL')[source]

Bases: Module

This is an implementation of Conformer encoder layer.

Parameters:

d_model (int) – The expected size of the input embedding.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
nhead (int) – Number of attention heads.
kernel_size (int, optional) – Kernel size of convolution model.
kdim (int, optional) – Dimension of the key.
vdim (int, optional) – Dimension of the value.
activation (torch.nn.Module, optional) – Activation function used in each Conformer layer.
bias (bool, optional) – Whether convolution module.
dropout (int, optional) – Dropout for the encoder.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_embs = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3)
>>> output = net(x, pos_embs=pos_embs)
>>> output[0].shape
torch.Size([8, 60, 512])

training: bool

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]

Parameters:

tgt (torch.Tensor) – The sequence to the decoder layer.
memory (torch.Tensor) – The sequence from the last layer of the encoder.
tgt_mask (torch.Tensor, optional, optional) – The mask for the tgt sequence.
memory_mask (torch.Tensor, optional) – The mask for the memory sequence.
tgt_key_padding_mask (torch.Tensor, optional) – The mask for the tgt keys per batch.
memory_key_padding_mask (torch.Tensor, optional) – The mask for the memory keys per batch.
pos_emb_tgt (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the target sequence positional embeddings for each attention layer.
pos_embs_src (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the source sequence positional embeddings for each attention layer.

class speechbrain.lobes.models.transformer.Conformer.ConformerDecoder(num_layers, nhead, d_ffn, d_model, kdim=None, vdim=None, dropout=0.0, activation=<class 'speechbrain.nnet.activations.Swish'>, kernel_size=3, bias=True, causal=True, attention_type='RelPosMHAXL')[source]

Bases: Module

This class implements the Transformer decoder.

Parameters:

num_layers (int) – Number of layers.
nhead (int) – Number of attention heads.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
d_model (int) – Embedding dimension size.
kdim (int, optional) – Dimension for key.
vdim (int, optional) – Dimension for value.
dropout (float, optional) – Dropout rate.
activation (torch.nn.Module, optional) – Activation function used after non-bottleneck conv layer.
kernel_size (int, optional) – Kernel size of convolutional layer.
bias (bool, optional) – Whether convolution module.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> src = torch.rand((8, 60, 512))
>>> tgt = torch.rand((8, 60, 512))
>>> net = ConformerDecoder(1, 8, 1024, 512, attention_type="regularMHA")
>>> output, _, _ = net(tgt, src)
>>> output.shape
torch.Size([8, 60, 512])

training: bool

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]

Parameters:

tgt (torch.Tensor) – The sequence to the decoder layer.
memory (torch.Tensor) – The sequence from the last layer of the encoder.
tgt_mask (torch.Tensor, optional, optional) – The mask for the tgt sequence.
memory_mask (torch.Tensor, optional) – The mask for the memory sequence.
tgt_key_padding_mask (torch.Tensor, optional) – The mask for the tgt keys per batch.
memory_key_padding_mask (torch.Tensor, optional) – The mask for the memory keys per batch.
pos_emb_tgt (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the target sequence positional embeddings for each attention layer.
pos_embs_src (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the source sequence positional embeddings for each attention layer.