speechbrain.lobes.models.transformer.Conformer module

Conformer implementation.

Authors * Jianyuan Zhong 2020 * Samuele Cornell 2021

Summary

Classes:

ConformerDecoder

This class implements the Transformer decoder.

ConformerDecoderLayer

This is an implementation of Conformer encoder layer.

ConformerEncoder

This class implements the Conformer encoder.

ConformerEncoderLayer

This is an implementation of Conformer encoder layer.

ConvolutionModule

This is an implementation of convolution module in Conformer.

Reference

class speechbrain.lobes.models.transformer.Conformer.ConvolutionModule(input_size, kernel_size=31, bias=True, activation=<class 'speechbrain.nnet.activations.Swish'>, dropout=0.0, causal=False, dilation=1)[source]

Bases: Module

This is an implementation of convolution module in Conformer.

Parameters
  • input_size (int) – The expected size of the input embedding dimension.

  • kernel_size (int, optional) – Kernel size of non-bottleneck convolutional layer.

  • bias (bool, optional) – Whether to use bias in the non-bottleneck conv layer.

  • activation (torch.nn.Module) – Activation function used after non-bottleneck conv layer.

  • dropout (float, optional) – Dropout rate.

  • causal (bool, optional) – Whether the convolution should be causal or not.

  • dilation (int, optional) – Dilation factor for the non bottleneck conv layer.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> net = ConvolutionModule(512, 3)
>>> output = net(x)
>>> output.shape
torch.Size([8, 60, 512])
forward(x, mask=None)[source]

Processes the input tensor x and returns the output an output tensor

training: bool
class speechbrain.lobes.models.transformer.Conformer.ConformerEncoderLayer(d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL')[source]

Bases: Module

This is an implementation of Conformer encoder layer.

Parameters
  • d_model (int) – The expected size of the input embedding.

  • d_ffn (int) – Hidden size of self-attention Feed Forward layer.

  • nhead (int) – Number of attention heads.

  • kernel_size (int, optional) – Kernel size of convolution model.

  • kdim (int, optional) – Dimension of the key.

  • vdim (int, optional) – Dimension of the value.

  • activation (torch.nn.Module) – Activation function used in each Conformer layer.

  • bias (bool, optional) – Whether convolution module.

  • dropout (int, optional) – Dropout for the encoder.

  • causal (bool, optional) – Whether the convolutions should be causal or not.

  • attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_embs = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3)
>>> output = net(x, pos_embs=pos_embs)
>>> output[0].shape
torch.Size([8, 60, 512])
forward(x, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None, pos_embs: Optional[Tensor] = None)[source]
Parameters
  • src (torch.Tensor) – The sequence to the encoder layer.

  • src_mask (torch.Tensor, optional) – The mask for the src sequence.

  • src_key_padding_mask (torch.Tensor, optional) – The mask for the src keys per batch.

  • pos_embs (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the input sequence positional embeddings

training: bool
class speechbrain.lobes.models.transformer.Conformer.ConformerEncoder(num_layers, d_model, d_ffn, nhead, kernel_size=31, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=False, attention_type='RelPosMHAXL')[source]

Bases: Module

This class implements the Conformer encoder.

Parameters
  • num_layers (int) – Number of layers.

  • d_model (int) – Embedding dimension size.

  • d_ffn (int) – Hidden size of self-attention Feed Forward layer.

  • nhead (int) – Number of attention heads.

  • kernel_size (int, optional) – Kernel size of convolution model.

  • kdim (int, optional) – Dimension of the key.

  • vdim (int, optional) – Dimension of the value.

  • activation (torch.nn.Module) – Activation function used in each Confomer layer.

  • bias (bool, optional) – Whether convolution module.

  • dropout (int, optional) – Dropout for the encoder.

  • causal (bool, optional) – Whether the convolutions should be causal or not.

  • attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_emb = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoder(1, 512, 512, 8)
>>> output, _ = net(x, pos_embs=pos_emb)
>>> output.shape
torch.Size([8, 60, 512])
forward(src, src_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None, pos_embs: Optional[Tensor] = None)[source]
Parameters
  • src (torch.Tensor) – The sequence to the encoder layer.

  • src_mask (torch.Tensor, optional) – The mask for the src sequence.

  • src_key_padding_mask (torch.Tensor, optional) – The mask for the src keys per batch.

  • pos_embs (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the input sequence positional embeddings

training: bool
class speechbrain.lobes.models.transformer.Conformer.ConformerDecoderLayer(d_model, d_ffn, nhead, kernel_size, kdim=None, vdim=None, activation=<class 'speechbrain.nnet.activations.Swish'>, bias=True, dropout=0.0, causal=True, attention_type='RelPosMHAXL')[source]

Bases: Module

This is an implementation of Conformer encoder layer.

Parameters
  • d_model (int) – The expected size of the input embedding.

  • d_ffn (int) – Hidden size of self-attention Feed Forward layer.

  • nhead (int) – Number of attention heads.

  • kernel_size (int, optional) – Kernel size of convolution model.

  • kdim (int, optional) – Dimension of the key.

  • vdim (int, optional) – Dimension of the value.

  • activation (torch.nn.Module, optional) – Activation function used in each Conformer layer.

  • bias (bool, optional) – Whether convolution module.

  • dropout (int, optional) – Dropout for the encoder.

  • causal (bool, optional) – Whether the convolutions should be causal or not.

  • attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> import torch
>>> x = torch.rand((8, 60, 512))
>>> pos_embs = torch.rand((1, 2*60-1, 512))
>>> net = ConformerEncoderLayer(d_ffn=512, nhead=8, d_model=512, kernel_size=3)
>>> output = net(x, pos_embs=pos_embs)
>>> output[0].shape
torch.Size([8, 60, 512])
forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]
Parameters
  • tgt (torch.Tensor) – The sequence to the decoder layer.

  • memory (torch.Tensor) – The sequence from the last layer of the encoder.

  • tgt_mask (torch.Tensor, optional, optional) – The mask for the tgt sequence.

  • memory_mask (torch.Tensor, optional) – The mask for the memory sequence.

  • tgt_key_padding_mask (torch.Tensor, optional) – The mask for the tgt keys per batch.

  • memory_key_padding_mask (torch.Tensor, optional) – The mask for the memory keys per batch.

  • pos_emb_tgt (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the target sequence positional embeddings for each attention layer.

  • pos_embs_src (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the source sequence positional embeddings for each attention layer.

training: bool
class speechbrain.lobes.models.transformer.Conformer.ConformerDecoder(num_layers, nhead, d_ffn, d_model, kdim=None, vdim=None, dropout=0.0, activation=<class 'speechbrain.nnet.activations.Swish'>, kernel_size=3, bias=True, causal=True, attention_type='RelPosMHAXL')[source]

Bases: Module

This class implements the Transformer decoder.

Parameters
  • num_layers (int) – Number of layers.

  • nhead (int) – Number of attention heads.

  • d_ffn (int) – Hidden size of self-attention Feed Forward layer.

  • d_model (int) – Embedding dimension size.

  • kdim (int, optional) – Dimension for key.

  • vdim (int, optional) – Dimension for value.

  • dropout (float, optional) – Dropout rate.

  • activation (torch.nn.Module, optional) – Activation function used after non-bottleneck conv layer.

  • kernel_size (int, optional) – Kernel size of convolutional layer.

  • bias (bool, optional) – Whether convolution module.

  • causal (bool, optional) – Whether the convolutions should be causal or not.

  • attention_type (str, optional) – type of attention layer, e.g. regulaMHA for regular MultiHeadAttention.

Example

>>> src = torch.rand((8, 60, 512))
>>> tgt = torch.rand((8, 60, 512))
>>> net = ConformerDecoder(1, 8, 1024, 512, attention_type="regularMHA")
>>> output, _, _ = net(tgt, src)
>>> output.shape
torch.Size([8, 60, 512])
training: bool
forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos_embs_tgt=None, pos_embs_src=None)[source]
Parameters
  • tgt (torch.Tensor) – The sequence to the decoder layer.

  • memory (torch.Tensor) – The sequence from the last layer of the encoder.

  • tgt_mask (torch.Tensor, optional, optional) – The mask for the tgt sequence.

  • memory_mask (torch.Tensor, optional) – The mask for the memory sequence.

  • tgt_key_padding_mask (torch.Tensor, optional) – The mask for the tgt keys per batch.

  • memory_key_padding_mask (torch.Tensor, optional) – The mask for the memory keys per batch.

  • pos_emb_tgt (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the target sequence positional embeddings for each attention layer.

  • pos_embs_src (torch.Tensor, torch.nn.Module, optional) – Module or tensor containing the source sequence positional embeddings for each attention layer.