speechbrain.lobes.models.transformer.Conformer module
Conformer implementation.
Authors * Jianyuan Zhong 2020 * Samuele Cornell 2021
Summary
Classes:
This class implements the Transformer decoder. |
|
This is an implementation of Conformer encoder layer. |
|
This class implements the Conformer encoder. |
|
This is an implementation of Conformer encoder layer. |
|
This is an implementation of convolution module in Conformer. |
Reference
- class speechbrain.lobes.models.transformer.Conformer.ConvolutionModule(input_size, kernel_size=31, bias=True, activation=<class 'speechbrain.nnet.activations.Swish'>, dropout=0.0, causal=False, dilation=1)[source]
Bases:
Module
This is an implementation of convolution module in Conformer.
- Parameters:
input_size (int) – The expected size of the input embedding dimension.
kernel_size (int, optional) – Kernel size of non-bottleneck convolutional layer.
bias (bool, optional) – Whether to use bias in the non-bottleneck conv layer.
activation (torch.nn.Module) – Activation function used after non-bottleneck conv layer.
dropout (float, optional) – Dropout rate.
causal (bool, optional) – Whether the convolution should be causal or not.
dilation (int, optional) – Dilation factor for the non bottleneck conv layer.
Example
>>> import torch >>> x = torch.rand((8, 60, 512)) >>> net = ConvolutionModule(512, 3) >>> output = net(x) >>> output.shape torch.Size([8, 60, 512])
- class speechbrain.lobes.models.transformer.Conformer.ConformerEncoderLayer(d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL')[source]
Bases:
Module
This is an implementation of Conformer encoder layer.
- Parameters:
d_model (int) – The expected size of the input embedding.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
nhead (int) – Number of attention heads.
kernel_size (int, optional) – Kernel size of convolution model.
kdim (int, optional) – Dimension of the key.
vdim (int, optional) – Dimension of the value.
activation (torch.nn.Module) – Activation function used in each Conformer layer.
bias (bool, optional) – Whether convolution module.
dropout (int, optional) – Dropout for the encoder.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.
Example
>>> import torch >>> x = torch.rand((8, 60, 512)) >>> pos_embs = torch.rand((1, 2*60-1, 512)) >>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3) >>> output = net(x, pos_embs=pos_embs) >>> output[0].shape torch.Size([8, 60, 512])
- forward(x, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor | None = None)[source]
- Parameters:
src (torch.Tensor) – The sequence to the encoder layer.
src_mask (torch.Tensor, optional) – The mask for the src sequence.
src_key_padding_mask (torch.Tensor, optional) – The mask for the src keys per batch.
pos_embs (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the input sequence positional embeddings
- class speechbrain.lobes.models.transformer.Conformer.ConformerEncoder(num_layers, d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL')[source]
Bases:
Module
This class implements the Conformer encoder.
- Parameters:
num_layers (int) – Number of layers.
d_model (int) – Embedding dimension size.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
nhead (int) – Number of attention heads.
kernel_size (int, optional) – Kernel size of convolution model.
kdim (int, optional) – Dimension of the key.
vdim (int, optional) – Dimension of the value.
activation (torch.nn.Module) – Activation function used in each Confomer layer.
bias (bool, optional) – Whether convolution module.
dropout (int, optional) – Dropout for the encoder.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.
Example
>>> import torch >>> x = torch.rand((8, 60, 512)) >>> pos_emb = torch.rand((1, 2*60-1, 512)) >>> net = ConformerEncoder(1, 512, 512, 8) >>> output, _ = net(x, pos_embs=pos_emb) >>> output.shape torch.Size([8, 60, 512])
- forward(src, src_mask: Tensor | None = None, src_key_padding_mask: Tensor | None = None, pos_embs: Tensor | None = None)[source]
- Parameters:
src (torch.Tensor) – The sequence to the encoder layer.
src_mask (torch.Tensor, optional) – The mask for the src sequence.
src_key_padding_mask (torch.Tensor, optional) – The mask for the src keys per batch.
pos_embs (torch.Tensor, torch.nn.Module,) – Module or tensor containing the input sequence positional embeddings If custom pos_embs are given it needs to have the shape (1, 2*S-1, E) where S is the sequence length, and E is the embedding dimension.
- class speechbrain.lobes.models.transformer.Conformer.ConformerDecoderLayer(d_model, d_ffn, nhead, kernel_size, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=True, attention_type='RelPosMHAXL')[source]
Bases:
Module
This is an implementation of Conformer encoder layer.
- Parameters:
d_model (int) – The expected size of the input embedding.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
nhead (int) – Number of attention heads.
kernel_size (int, optional) – Kernel size of convolution model.
kdim (int, optional) – Dimension of the key.
vdim (int, optional) – Dimension of the value.
activation (torch.nn.Module, optional) – Activation function used in each Conformer layer.
bias (bool, optional) – Whether convolution module.
dropout (int, optional) – Dropout for the encoder.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.
Example
>>> import torch >>> x = torch.rand((8, 60, 512)) >>> pos_embs = torch.rand((1, 2*60-1, 512)) >>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3) >>> output = net(x, pos_embs=pos_embs) >>> output[0].shape torch.Size([8, 60, 512])
- forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]
- Parameters:
tgt (torch.Tensor) – The sequence to the decoder layer.
memory (torch.Tensor) – The sequence from the last layer of the encoder.
tgt_mask (torch.Tensor, optional, optional) – The mask for the tgt sequence.
memory_mask (torch.Tensor, optional) – The mask for the memory sequence.
tgt_key_padding_mask (torch.Tensor, optional) – The mask for the tgt keys per batch.
memory_key_padding_mask (torch.Tensor, optional) – The mask for the memory keys per batch.
pos_emb_tgt (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the target sequence positional embeddings for each attention layer.
pos_embs_src (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the source sequence positional embeddings for each attention layer.
- class speechbrain.lobes.models.transformer.Conformer.ConformerDecoder(num_layers, nhead, d_ffn, d_model, kdim=None, vdim=None, dropout=0.0, activation=<class 'speechbrain.nnet.activations.Swish'>, kernel_size=3, bias=True, causal=True, attention_type='RelPosMHAXL')[source]
Bases:
Module
This class implements the Transformer decoder.
- Parameters:
num_layers (int) – Number of layers.
nhead (int) – Number of attention heads.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
d_model (int) – Embedding dimension size.
kdim (int, optional) – Dimension for key.
vdim (int, optional) – Dimension for value.
dropout (float, optional) – Dropout rate.
activation (torch.nn.Module, optional) – Activation function used after non-bottleneck conv layer.
kernel_size (int, optional) – Kernel size of convolutional layer.
bias (bool, optional) – Whether convolution module.
causal (bool, optional) – Whether the convolutions should be causal or not.
attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.
Example
>>> src = torch.rand((8, 60, 512)) >>> tgt = torch.rand((8, 60, 512)) >>> net = ConformerDecoder(1, 8, 1024, 512, attention_type="regularMHA") >>> output, _, _ = net(tgt, src) >>> output.shape torch.Size([8, 60, 512])
- forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]
- Parameters:
tgt (torch.Tensor) – The sequence to the decoder layer.
memory (torch.Tensor) – The sequence from the last layer of the encoder.
tgt_mask (torch.Tensor, optional, optional) – The mask for the tgt sequence.
memory_mask (torch.Tensor, optional) – The mask for the memory sequence.
tgt_key_padding_mask (torch.Tensor, optional) – The mask for the tgt keys per batch.
memory_key_padding_mask (torch.Tensor, optional) – The mask for the memory keys per batch.
pos_emb_tgt (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the target sequence positional embeddings for each attention layer.
pos_embs_src (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the source sequence positional embeddings for each attention layer.