speechbrain.lobes.models.transformer.Conformer module

Conformer implementation.

Authors * Jianyuan Zhong 2020 * Samuele Cornell 2021

Summary

Classes:

`ConformerDecoder`	This class implements the Transformer decoder.
`ConformerDecoderLayer`	This is an implementation of Conformer encoder layer.
`ConformerEncoder`	This class implements the Conformer encoder.
`ConformerEncoderLayer`	This is an implementation of Conformer encoder layer.
`ConvolutionModule`	This is an implementation of convolution module in Conformer.

Reference

class speechbrain.lobes.models.transformer.Conformer.ConvolutionModule(input_size, kernel_size=31, bias=True, activation=<class 'speechbrain.nnet.activations.Swish'>, dropout=0.0, causal=False, dilation=1)[source]

Bases: Module

This is an implementation of convolution module in Conformer.

Parameters:

input_size (int) – The expected size of the input embedding dimension.
kernel_size (int, optional) – Kernel size of non-bottleneck convolutional layer.
bias (bool, optional) – Whether to use bias in the non-bottleneck conv layer.
activation (torch.nn.Module) – Activation function used after non-bottleneck conv layer.
dropout (float, optional) – Dropout rate.
causal (bool, optional) – Whether the convolution should be causal or not.
dilation (int, optional) – Dilation factor for the non bottleneck conv layer.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> net = ConvolutionModule(512, 3)
>>> output = net(x)
>>> output.shape
torch.Size([8, 60, 512])

forward(x, mask=None)[source]: Processes the input tensor x and returns the output an output tensor

training: bool

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoderLayer(d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL')[source]

Bases: Module

This is an implementation of Conformer encoder layer.

Parameters:

d_model (int) – The expected size of the input embedding.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
nhead (int) – Number of attention heads.
kernel_size (int, optional) – Kernel size of convolution model.
kdim (int, optional) – Dimension of the key.
vdim (int, optional) – Dimension of the value.
activation (torch.nn.Module) – Activation function used in each Conformer layer.
bias (bool, optional) – Whether convolution module.
dropout (int, optional) – Dropout for the encoder.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_embs = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3)
>>> output = net(x, pos_embs=pos_embs)
>>> output[0].shape
torch.Size([8, 60, 512])

forward(x, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor | None = None)[source]

Parameters:

src (torch.Tensor) – The sequence to the encoder layer.
src_mask (torch.Tensor, optional) – The mask for the src sequence.
src_key_padding_mask (torch.Tensor, optional) – The mask for the src keys per batch.
pos_embs (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the input sequence positional embeddings

training: bool

class speechbrain.lobes.models.transformer.Conformer.ConformerEncoder(num_layers, d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL')[source]

Bases: Module

This class implements the Conformer encoder.

Parameters:

num_layers (int) – Number of layers.
d_model (int) – Embedding dimension size.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
nhead (int) – Number of attention heads.
kernel_size (int, optional) – Kernel size of convolution model.
kdim (int, optional) – Dimension of the key.
vdim (int, optional) – Dimension of the value.
activation (torch.nn.Module) – Activation function used in each Confomer layer.
bias (bool, optional) – Whether convolution module.
dropout (int, optional) – Dropout for the encoder.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_emb = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoder(1, 512, 512, 8)
>>> output, _ = net(x, pos_embs=pos_emb)
>>> output.shape
torch.Size([8, 60, 512])

forward(src, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor | None = None)[source]

Parameters:

src (torch.Tensor) – The sequence to the encoder layer.
src_mask (torch.Tensor, optional) – The mask for the src sequence.
src_key_padding_mask (torch.Tensor, optional) – The mask for the src keys per batch.
pos_embs (torch.Tensor, torch.nn.Module,) – Module or tensor containing the input sequence positional embeddings If custom pos_embs are given it needs to have the shape (1, 2*S-1, E) where S is the sequence length, and E is the embedding dimension.

training: bool

class speechbrain.lobes.models.transformer.Conformer.ConformerDecoderLayer(d_model, d_ffn, nhead, kernel_size, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=True, attention_type='RelPosMHAXL')[source]

Bases: Module

This is an implementation of Conformer encoder layer.

Parameters:

d_model (int) – The expected size of the input embedding.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
nhead (int) – Number of attention heads.
kernel_size (int, optional) – Kernel size of convolution model.
kdim (int, optional) – Dimension of the key.
vdim (int, optional) – Dimension of the value.
activation (torch.nn.Module, optional) – Activation function used in each Conformer layer.
bias (bool, optional) – Whether convolution module.
dropout (int, optional) – Dropout for the encoder.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_embs = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3)
>>> output = net(x, pos_embs=pos_embs)
>>> output[0].shape
torch.Size([8, 60, 512])

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]

Parameters:

tgt (torch.Tensor) – The sequence to the decoder layer.
memory (torch.Tensor) – The sequence from the last layer of the encoder.
tgt_mask (torch.Tensor, optional, optional) – The mask for the tgt sequence.
memory_mask (torch.Tensor, optional) – The mask for the memory sequence.
tgt_key_padding_mask (torch.Tensor, optional) – The mask for the tgt keys per batch.
memory_key_padding_mask (torch.Tensor, optional) – The mask for the memory keys per batch.
pos_emb_tgt (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the target sequence positional embeddings for each attention layer.
pos_embs_src (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the source sequence positional embeddings for each attention layer.

training: bool

class speechbrain.lobes.models.transformer.Conformer.ConformerDecoder(num_layers, nhead, d_ffn, d_model, kdim=None, vdim=None, dropout=0.0, activation=<class 'speechbrain.nnet.activations.Swish'>, kernel_size=3, bias=True, causal=True, attention_type='RelPosMHAXL')[source]

Bases: Module

This class implements the Transformer decoder.

Parameters:

num_layers (int) – Number of layers.
nhead (int) – Number of attention heads.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
d_model (int) – Embedding dimension size.
kdim (int, optional) – Dimension for key.
vdim (int, optional) – Dimension for value.
dropout (float, optional) – Dropout rate.
activation (torch.nn.Module, optional) – Activation function used after non-bottleneck conv layer.
kernel_size (int, optional) – Kernel size of convolutional layer.
bias (bool, optional) – Whether convolution module.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> src = torch.rand((8, 60, 512))
>>> tgt = torch.rand((8, 60, 512))
>>> net = ConformerDecoder(1, 8, 1024, 512, attention_type="regularMHA")
>>> output, _, _ = net(tgt, src)
>>> output.shape
torch.Size([8, 60, 512])

training: bool

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]

Parameters:

tgt (torch.Tensor) – The sequence to the decoder layer.
memory (torch.Tensor) – The sequence from the last layer of the encoder.
tgt_mask (torch.Tensor, optional, optional) – The mask for the tgt sequence.
memory_mask (torch.Tensor, optional) – The mask for the memory sequence.
tgt_key_padding_mask (torch.Tensor, optional) – The mask for the tgt keys per batch.
memory_key_padding_mask (torch.Tensor, optional) – The mask for the memory keys per batch.
pos_emb_tgt (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the target sequence positional embeddings for each attention layer.
pos_embs_src (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the source sequence positional embeddings for each attention layer.