speechbrain.lobes.models.transformer.Transformer module¶
Transformer implementaion in the SpeechBrain sytle.
Authors * Jianyuan Zhong 2020
Summary¶
Classes:
This class implements the normalized embedding layer for the transformer. |
|
This class implements the positional encoding function. |
|
This class implements the Transformer decoder. |
|
This class implements the self-attention decoder layer. |
|
This class implements the transformer encoder. |
|
This is an implementation of self-attention encoder layer. |
|
This is an interface for transformer model. |
Functions:
Creates a binary mask to prevent attention to padded locations. |
|
Creates a binary mask for each sequence. |
Reference¶
-
class
speechbrain.lobes.models.transformer.Transformer.
TransformerInterface
(d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, d_ffn=2048, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, custom_src_module=None, custom_tgt_module=None, positional_encoding=True, normalize_before=False, kernel_size: Optional[int] = 31, bias: Optional[bool] = True, encoder_module: Optional[str] = 'transformer', conformer_activation: Optional[torch.nn.modules.module.Module] = <class 'speechbrain.nnet.activations.Swish'>)[source]¶ Bases:
torch.nn.modules.module.Module
This is an interface for transformer model.
Users can modify the attributes and define the forward function as needed according to their own tasks.
The architecture is based on the paper “Attention Is All You Need”: https://arxiv.org/pdf/1706.03762.pdf
- Parameters
d_model (int) – The number of expected features in the encoder/decoder inputs (default=512).
nhead (int) – The number of heads in the multi-head attention models (default=8).
num_encoder_layers (int) – The number of sub-encoder-layers in the encoder (default=6).
num_decoder_layers (int) – The number of sub-decoder-layers in the decoder (default=6).
dim_ffn (int) – The dimension of the feedforward network model (default=2048).
dropout (int) – The dropout value (default=0.1).
activation (torch class) – The activation function of encoder/decoder intermediate layer, e.g., relu or gelu (default=relu)
custom_src_module (torch class) – Module that processes the src features to expected feature dim.
custom_tgt_module (torch class) – Module that processes the src features to expected feature dim.
-
class
speechbrain.lobes.models.transformer.Transformer.
PositionalEncoding
(input_size, max_len=2500)[source]¶ Bases:
torch.nn.modules.module.Module
This class implements the positional encoding function.
PE(pos, 2i) = sin(pos/(10000^(2i/dmodel))) PE(pos, 2i+1) = cos(pos/(10000^(2i/dmodel)))
- Parameters
max_len (int) – Max length of the input sequences (default 2500).
Example
>>> a = torch.rand((8, 120, 512)) >>> enc = PositionalEncoding(input_size=a.shape[-1]) >>> b = enc(a) >>> b.shape torch.Size([1, 120, 512])
-
class
speechbrain.lobes.models.transformer.Transformer.
TransformerEncoderLayer
(d_ffn, nhead, d_model=None, kdim=None, vdim=None, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, normalize_before=False)[source]¶ Bases:
torch.nn.modules.module.Module
This is an implementation of self-attention encoder layer.
- d_ffnint
Hidden size of self-attention Feed Forward layer.
- nheadint
Number of attention heads.
- d_modelint
The expected size of the input embedding.
- reshapebool
Whether to automatically shape 4-d input to 3-d.
- kdimint
Dimension of the key (Optional).
- vdimint
Dimension of the value (Optional).
- dropoutfloat
Dropout for the encoder (Optional).
Example
>>> import torch >>> x = torch.rand((8, 60, 512)) >>> net = TransformerEncoderLayer(512, 8, d_model=512) >>> output = net(x) >>> output[0].shape torch.Size([8, 60, 512])
-
forward
(src, src_mask: Optional[torch.Tensor] = None, src_key_padding_mask: Optional[torch.Tensor] = None)[source]¶ - srctensor
The sequence to the encoder layer (required).
- src_masktensor
The mask for the src sequence (optional).
- src_key_padding_masktensor
The mask for the src keys per batch (optional).
-
class
speechbrain.lobes.models.transformer.Transformer.
TransformerEncoder
(num_layers, nhead, d_ffn, input_shape=None, d_model=None, kdim=None, vdim=None, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, normalize_before=False)[source]¶ Bases:
torch.nn.modules.module.Module
This class implements the transformer encoder.
- Parameters
num_layers (int) – Number of transformer layers to include.
nhead (int) – Number of attention heads.
d_ffn (int) – Hidden size of self-attention Feed Forward layer.
input_shape (tuple) – Expected shape of an example input.
d_model (int) – The dimension of the input embedding.
kdim (int) – Dimension for key (Optional).
vdim (int) – Dimension for value (Optional).
dropout (float) – Dropout for the encoder (Optional).
input_module (torch class) – The module to process the source input feature to expected feature dimension (Optional).
Example
>>> import torch >>> x = torch.rand((8, 60, 512)) >>> net = TransformerEncoder(1, 8, 512, d_model=512) >>> output, _ = net(x) >>> output.shape torch.Size([8, 60, 512])
-
forward
(src, src_mask: Optional[torch.Tensor] = None, src_key_padding_mask: Optional[torch.Tensor] = None)[source]¶ - Parameters
src (tensor) – The sequence to the encoder layer (required).
src_mask (tensor) – The mask for the src sequence (optional).
src_key_padding_mask (tensor) – The mask for the src keys per batch (optional).
-
class
speechbrain.lobes.models.transformer.Transformer.
TransformerDecoderLayer
(d_ffn, nhead, d_model, kdim=None, vdim=None, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, normalize_before=False)[source]¶ Bases:
torch.nn.modules.module.Module
This class implements the self-attention decoder layer.
- Parameters
Example
>>> src = torch.rand((8, 60, 512)) >>> tgt = torch.rand((8, 60, 512)) >>> net = TransformerDecoderLayer(1024, 8, d_model=512) >>> output, self_attn, multihead_attn = net(src, tgt) >>> output.shape torch.Size([8, 60, 512])
-
forward
(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]¶ - tgt: tensor
The sequence to the decoder layer (required).
- memory: tensor
The sequence from the last layer of the encoder (required).
- tgt_mask: tensor
The mask for the tgt sequence (optional).
- memory_mask: tensor
The mask for the memory sequence (optional).
- tgt_key_padding_mask: tensor
The mask for the tgt keys per batch (optional).
- memory_key_padding_mask: tensor
The mask for the memory keys per batch (optional).
-
class
speechbrain.lobes.models.transformer.Transformer.
TransformerDecoder
(num_layers, nhead, d_ffn, d_model, kdim=None, vdim=None, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, normalize_before=False)[source]¶ Bases:
torch.nn.modules.module.Module
This class implements the Transformer decoder.
- Parameters
Example
>>> src = torch.rand((8, 60, 512)) >>> tgt = torch.rand((8, 60, 512)) >>> net = TransformerDecoder(1, 8, 1024, d_model=512) >>> output, _, _ = net(src, tgt) >>> output.shape torch.Size([8, 60, 512])
-
forward
(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]¶ - Parameters
tgt (tensor) – The sequence to the decoder layer (required).
memory (tensor) – The sequence from the last layer of the encoder (required).
tgt_mask (tensor) – The mask for the tgt sequence (optional).
memory_mask (tensor) – The mask for the memory sequence (optional).
tgt_key_padding_mask (tensor) – The mask for the tgt keys per batch (optional).
memory_key_padding_mask (tensor) – The mask for the memory keys per batch (optional).
-
class
speechbrain.lobes.models.transformer.Transformer.
NormalizedEmbedding
(d_model, vocab)[source]¶ Bases:
torch.nn.modules.module.Module
This class implements the normalized embedding layer for the transformer.
Since the dot product of the self-attention is always normalized by sqrt(d_model) and the final linear projection for prediction shares weight with the embedding layer, we multiply the output of the embedding by sqrt(d_model).
- Parameters
Example
>>> emb = NormalizedEmbedding(512, 1000) >>> trg = torch.randint(0, 999, (8, 50)) >>> emb_fea = emb(trg)
-
speechbrain.lobes.models.transformer.Transformer.
get_key_padding_mask
(padded_input, pad_idx)[source]¶ Creates a binary mask to prevent attention to padded locations.
- padded_input: int
Padded input.
- pad_idx:
idx for padding element.
Example
>>> a = torch.LongTensor([[1,1,0], [2,3,0], [4,5,0]]) >>> get_key_padding_mask(a, pad_idx=0) tensor([[False, False, True], [False, False, True], [False, False, True]])
-
speechbrain.lobes.models.transformer.Transformer.
get_lookahead_mask
(padded_input)[source]¶ Creates a binary mask for each sequence.
- Parameters
padded_input (tensor) – Padded input tensor.
Example
>>> a = torch.LongTensor([[1,1,0], [2,3,0], [4,5,0]]) >>> get_lookahead_mask(a) tensor([[0., -inf, -inf], [0., 0., -inf], [0., 0., 0.]])