speechbrain.lobes.models.g2p.model moduleο
The Attentional RNN model for Grapheme-to-Phoneme
- Authors
Mirco Ravanelli 2021
Artem Ploujnikov 2021
Summaryο
Classes:
The Attentional RNN encoder-decoder model |
|
A Transformer-based Grapheme-to-Phoneme model |
|
A small encoder module that reduces the dimensionality and normalizes word embeddings |
Functions:
Creates a dummy phoneme sequence |
|
Computes the input dimension (intended for hparam files) |
Referenceο
- class speechbrain.lobes.models.g2p.model.AttentionSeq2Seq(enc, encoder_emb, emb, dec, lin, out, bos_token=0, use_word_emb=False, word_emb_enc=None)[source]ο
Bases:
Module
The Attentional RNN encoder-decoder model
- Parameters:
enc (torch.nn.Module) β the encoder module
encoder_emb (torch.nn.Module) β the encoder_embedding_module
emb (torch.nn.Module) β the embedding module
dec (torch.nn.Module) β the decoder module
lin (torch.nn.Module) β the linear module
out (torch.nn.Module) β the output layer (typically log_softmax)
bos_token (int) β the index of the Beginning-of-Sentence token
use_word_emb (bool) β whether or not to use word embedding
word_emb_enc (nn.Module) β a module to encode word embeddings
- forward(grapheme_encoded, phn_encoded=None, word_emb=None)[source]ο
Computes the forward pass
- Parameters:
grapheme_encoded (torch.Tensor) β graphemes encoded as a Torch tensor
phn_encoded (torch.Tensor) β the encoded phonemes
word_emb (torch.Tensor) β word embeddings (optional)
- Returns:
p_seq (torch.Tensor) β a (batch x position x token) tensor of token probabilities in each position
char_lens (torch.Tensor) β a tensor of character sequence lengths
encoder_out β the raw output of the encoder
- class speechbrain.lobes.models.g2p.model.WordEmbeddingEncoder(word_emb_dim, word_emb_enc_dim, norm=None, norm_type=None)[source]ο
Bases:
Module
A small encoder module that reduces the dimensionality and normalizes word embeddings
- Parameters:
word_emb_dim (int) β the dimension of the original word embeddings
word_emb_enc_dim (int) β the dimension of the encoded word embeddings
norm (torch.nn.Module) β
- the normalization to be used (
e.g. speechbrain.nnet.normalization.LayerNorm)
norm_type (str) β the type of normalization to be used
- forward(emb)[source]ο
Computes the forward pass of the embedding
- Parameters:
emb (torch.Tensor) β the original word embeddings
- Returns:
emb_enc β encoded word embeddings
- Return type:
torch.Tensor
- NORMS = {'batch': <class 'speechbrain.nnet.normalization.BatchNorm1d'>, 'instance': <class 'speechbrain.nnet.normalization.InstanceNorm1d'>, 'layer': <class 'speechbrain.nnet.normalization.LayerNorm'>}ο
- class speechbrain.lobes.models.g2p.model.TransformerG2P(emb, encoder_emb, char_lin, phn_lin, lin, out, d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, d_ffn=2048, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, custom_src_module=None, custom_tgt_module=None, positional_encoding='fixed_abs_sine', normalize_before=True, kernel_size=15, bias=True, encoder_module='transformer', attention_type='regularMHA', max_length=2500, causal=False, pad_idx=0, encoder_kdim=None, encoder_vdim=None, decoder_kdim=None, decoder_vdim=None, use_word_emb=False, word_emb_enc=None)[source]ο
Bases:
TransformerInterface
A Transformer-based Grapheme-to-Phoneme model
- Parameters:
emb (torch.nn.Module) β the embedding module
encoder_emb (torch.nn.Module) β the encoder embedding module
char_lin (torch.nn.Module) β a linear module connecting the inputs to the transformer
phn_lin (torch.nn.Module) β a linear module connecting the outputs to the transformer
out (torch.nn.Module) β the decoder module (usually Softmax)
lin (torch.nn.Module) β the linear module for outputs
d_model (int) β The number of expected features in the encoder/decoder inputs (default=512).
nhead (int) β The number of heads in the multi-head attention models (default=8).
num_encoder_layers (int, optional) β The number of encoder layers in1Γ¬ the encoder.
num_decoder_layers (int, optional) β The number of decoder layers in the decoder.
dim_ffn (int, optional) β The dimension of the feedforward network model hidden layer.
dropout (int, optional) β The dropout value.
activation (torch.nn.Module, optional) β The activation function for Feed-Forward Network layer, e.g., relu or gelu or swish.
custom_src_module (torch.nn.Module, optional) β Module that processes the src features to expected feature dim.
custom_tgt_module (torch.nn.Module, optional) β Module that processes the src features to expected feature dim.
positional_encoding (str, optional) β Type of positional encoding used. e.g. βfixed_abs_sineβ for fixed absolute positional encodings.
normalize_before (bool, optional) β Whether normalization should be applied before or after MHA or FFN in Transformer layers. Defaults to True as this was shown to lead to better performance and training stability.
kernel_size (int, optional) β Kernel size in convolutional layers when Conformer is used.
bias (bool, optional) β Whether to use bias in Conformer convolutional layers.
encoder_module (str, optional) β Choose between Conformer and Transformer for the encoder. The decoder is fixed to be a Transformer.
conformer_activation (torch.nn.Module, optional) β Activation module used after Conformer convolutional layers. E.g. Swish, ReLU etc. it has to be a torch Module.
attention_type (str, optional) β Type of attention layer used in all Transformer or Conformer layers. e.g. regularMHA or RelPosMHA.
max_length (int, optional) β Max length for the target and source sequence in input. Used for positional encodings.
causal (bool, optional) β Whether the encoder should be causal or not (the decoder is always causal). If causal the Conformer convolutional layer is causal.
pad_idx (int) β the padding index (for masks)
encoder_kdim (int, optional) β Dimension of the key for the encoder.
encoder_vdim (int, optional) β Dimension of the value for the encoder.
decoder_kdim (int, optional) β Dimension of the key for the decoder.
decoder_vdim (int, optional) β Dimension of the value for the decoder.
- forward(grapheme_encoded, phn_encoded=None, word_emb=None)[source]ο
Computes the forward pass
- Parameters:
grapheme_encoded (torch.Tensor) β graphemes encoded as a Torch tensor
phn_encoded (torch.Tensor) β the encoded phonemes
word_emb (torch.Tensor) β word embeddings (if applicable)
- Returns:
p_seq (torch.Tensor) β the log-probabilities of individual tokens i a sequence
char_lens (torch.Tensor) β the character length syntax
encoder_out (torch.Tensor) β the encoder state
attention (torch.Tensor) β the attention state
- make_masks(src, tgt, src_len=None, pad_idx=0)[source]ο
This method generates the masks for training the transformer model.
- Parameters:
src (torch.Tensor) β The sequence to the encoder (required).
tgt (torch.Tensor) β The sequence to the decoder (required).
src_len (torch.Tensor) β Lengths corresponding to the src tensor.
pad_idx (int) β The index for <pad> token (default=0).
- Returns:
src_key_padding_mask (torch.Tensor) β the source key padding mask
tgt_key_padding_mask (torch.Tensor) β the target key padding masks
src_mask (torch.Tensor) β the source mask
tgt_mask (torch.Tensor) β the target mask
- decode(tgt, encoder_out, enc_lens)[source]ο
This method implements a decoding step for the transformer model.
- Parameters:
tgt (torch.Tensor) β The sequence to the decoder.
encoder_out (torch.Tensor) β Hidden output of the encoder.
enc_lens (torch.Tensor) β The corresponding lengths of the encoder inputs.
- Returns:
prediction (torch.Tensor) β the predicted sequence
attention (torch.Tensor) β the attention matrix corresponding to the last attention head (useful for plotting attention)