speechbrain.lobes.models.transformer.TransformerST moduleο
Transformer for ST in the SpeechBrain style.
Authors * YAO FEI, CHENG 2021
Summaryο
Classes:
This is an implementation of transformer model for ST. |
Referenceο
- class speechbrain.lobes.models.transformer.TransformerST.TransformerST(tgt_vocab, input_size, d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, d_ffn=2048, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, positional_encoding='fixed_abs_sine', normalize_before=False, kernel_size: int | None = 31, bias: bool | None = True, encoder_module: str | None = 'transformer', conformer_activation: ~torch.nn.modules.module.Module | None = <class 'speechbrain.nnet.activations.Swish'>, attention_type: str | None = 'regularMHA', max_length: int | None = 2500, causal: bool | None = True, ctc_weight: float = 0.0, asr_weight: float = 0.0, mt_weight: float = 0.0, asr_tgt_vocab: int = 0, mt_src_vocab: int = 0)[source]ο
Bases:
TransformerASR
This is an implementation of transformer model for ST.
The architecture is based on the paper βAttention Is All You Needβ: https://arxiv.org/pdf/1706.03762.pdf
- Parameters:
tgt_vocab (int) β Size of vocabulary.
input_size (int) β Input feature size.
d_model (int, optional) β Embedding dimension size. (default=512).
nhead (int, optional) β The number of heads in the multi-head attention models (default=8).
num_encoder_layers (int, optional) β The number of sub-encoder-layers in the encoder (default=6).
num_decoder_layers (int, optional) β The number of sub-decoder-layers in the decoder (default=6).
d_ffn (int, optional) β The dimension of the feedforward network model (default=2048).
dropout (int, optional) β The dropout value (default=0.1).
activation (torch.nn.Module, optional) β The activation function of FFN layers. Recommended: relu or gelu (default=relu).
positional_encoding (str, optional) β Type of positional encoding used. e.g. βfixed_abs_sineβ for fixed absolute positional encodings.
normalize_before (bool, optional) β Whether normalization should be applied before or after MHA or FFN in Transformer layers. Defaults to True as this was shown to lead to better performance and training stability.
kernel_size (int, optional) β Kernel size in convolutional layers when Conformer is used.
bias (bool, optional) β Whether to use bias in Conformer convolutional layers.
encoder_module (str, optional) β Choose between Conformer and Transformer for the encoder. The decoder is fixed to be a Transformer.
conformer_activation (torch.nn.Module, optional) β Activation module used after Conformer convolutional layers. E.g. Swish, ReLU etc. it has to be a torch Module.
attention_type (str, optional) β Type of attention layer used in all Transformer or Conformer layers. e.g. regularMHA or RelPosMHA.
max_length (int, optional) β Max length for the target and source sequence in input. Used for positional encodings.
causal (bool, optional) β Whether the encoder should be causal or not (the decoder is always causal). If causal the Conformer convolutional layer is causal.
ctc_weight (float) β The weight of ctc for asr task
asr_weight (float) β The weight of asr task for calculating loss
mt_weight (float) β The weight of mt task for calculating loss
asr_tgt_vocab (int) β The size of the asr target language
mt_src_vocab (int) β The size of the mt source language
Example
>>> src = torch.rand([8, 120, 512]) >>> tgt = torch.randint(0, 720, [8, 120]) >>> net = TransformerST( ... 720, 512, 512, 8, 1, 1, 1024, activation=torch.nn.GELU, ... ctc_weight=1, asr_weight=0.3, ... ) >>> enc_out, dec_out = net.forward(src, tgt) >>> enc_out.shape torch.Size([8, 120, 512]) >>> dec_out.shape torch.Size([8, 120, 512])
- forward_asr(encoder_out, src, tgt, wav_len, pad_idx=0)[source]ο
This method implements a decoding step for asr task
- Parameters:
encoder_out (torch.Tensor) β The representation of the encoder (required).
src (torch.Tensor) β Input sequence (required).
tgt (torch.Tensor) β The sequence to the decoder (transcription) (required).
wav_len (torch.Tensor) β Length of input tensors (required).
pad_idx (int) β The index for <pad> token (default=0).
- Returns:
asr_decoder_out β One step of asr decoder.
- Return type:
torch.Tensor
- forward_mt(src, tgt, pad_idx=0)[source]ο
This method implements a forward step for mt task
- Parameters:
src (torch.Tensor) β The sequence to the encoder (transcription) (required).
tgt (torch.Tensor) β The sequence to the decoder (translation) (required).
pad_idx (int) β The index for <pad> token (default=0).
- Returns:
encoder_out (torch.Tensor) β Output of encoder
decoder_out (torch.Tensor) β Output of decoder
- forward_mt_decoder_only(src, tgt, pad_idx=0)[source]ο
This method implements a forward step for mt task using a wav2vec encoder (same than above, but without the encoder stack)
- Parameters:
(transcription) (src) β output features from the w2v2 encoder
(translation) (tgt) β The sequence to the decoder (required).
pad_idx (int) β The index for <pad> token (default=0).
- decode_asr(tgt, encoder_out)[source]ο
This method implements a decoding step for the transformer model.
- Parameters:
tgt (torch.Tensor) β The sequence to the decoder.
encoder_out (torch.Tensor) β Hidden output of the encoder.
- Returns:
prediction (torch.Tensor) β The predicted outputs.
multihead_attns (torch.Tensor) β The last step of attention.
- make_masks_for_mt(src, tgt, pad_idx=0)[source]ο
This method generates the masks for training the transformer model.
- Parameters:
src (torch.Tensor) β The sequence to the encoder (required).
tgt (torch.Tensor) β The sequence to the decoder (required).
pad_idx (int) β The index for <pad> token (default=0).
- Returns:
src_key_padding_mask (torch.Tensor) β Timesteps to mask due to padding
tgt_key_padding_mask (torch.Tensor) β Timesteps to mask due to padding
src_mask (torch.Tensor) β Timesteps to mask for causality
tgt_mask (torch.Tensor) β Timesteps to mask for causality