speechbrain.lobes.models.g2p.model module

The Attentional RNN model for Grapheme-to-Phoneme

Authors
  • Mirco Ravinelli 2021

  • Artem Ploujnikov 2021

Summary

Classes:

AttentionSeq2Seq

The Attentional RNN encoder-decoder model

TransformerG2P

A Transformer-based Grapheme-to-Phoneme model

WordEmbeddingEncoder

A small encoder module that reduces the dimensionality and normalizes word embeddings

Functions:

get_dummy_phonemes

Creates a dummy phoneme sequence

input_dim

Computes the input dimension (intended for hparam files)

Reference

class speechbrain.lobes.models.g2p.model.AttentionSeq2Seq(enc, encoder_emb, emb, dec, lin, out, bos_token=0, use_word_emb=False, word_emb_enc=None)[source]

Bases: Module

The Attentional RNN encoder-decoder model

Parameters:
  • enc (torch.nn.Module) – the encoder module

  • encoder_emb (torch.nn.Module) – the encoder_embedding_module

  • emb (torch.nn.Module) – the embedding module

  • dec (torch.nn.Module) – the decoder module

  • lin (torch.nn.Module) – the linear module

  • out (torch.nn.Module) – the output layer (typically log_softmax)

  • use_word_emb (bool) – whether or not to use word embedding

  • bos_token (int) – the index of teh Beginning-of-Sentence token

  • word_emb_enc (nn.Module) – a module to encode word embeddings

Returns:

result – a (p_seq, char_lens) tuple

Return type:

tuple

forward(grapheme_encoded, phn_encoded=None, word_emb=None, **kwargs)[source]

Computes the forward pass

Parameters:
  • grapheme_encoded (torch.Tensor) – graphemes encoded as a Torch tensor

  • phn_encoded (torch.Tensor) – the encoded phonemes

  • word_emb (torch.Tensor) – word embeddings (optional)

Returns:

  • p_seq (torch.Tensor) – a (batch x position x token) tensor of token probabilities in each position

  • char_lens (torch.Tensor) – a tensor of character sequence lengths

  • encoder_out – the raw output of the encoder

training: bool
class speechbrain.lobes.models.g2p.model.WordEmbeddingEncoder(word_emb_dim, word_emb_enc_dim, norm=None, norm_type=None)[source]

Bases: Module

A small encoder module that reduces the dimensionality and normalizes word embeddings

Parameters:
  • word_emb_dim (int) – the dimension of the original word embeddings

  • word_emb_enc_dim (int) – the dimension of the encoded word embeddings

  • norm (torch.nn.Module) –

    the normalization to be used (

    e.g. speechbrain.nnet.normalization.LayerNorm)

  • norm_type (str) – the type of normalization to be used

forward(emb)[source]

Computes the forward pass of the embedding

Parameters:

emb (torch.Tensor) – the original word embeddings

Returns:

emb_enc – encoded word embeddings

Return type:

torch.Tensor

NORMS = {'batch': <class 'speechbrain.nnet.normalization.BatchNorm1d'>, 'instance': <class 'speechbrain.nnet.normalization.InstanceNorm1d'>, 'layer': <class 'speechbrain.nnet.normalization.LayerNorm'>}
training: bool
class speechbrain.lobes.models.g2p.model.TransformerG2P(emb, encoder_emb, char_lin, phn_lin, lin, out, d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, d_ffn=2048, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, custom_src_module=None, custom_tgt_module=None, positional_encoding='fixed_abs_sine', normalize_before=True, kernel_size=15, bias=True, encoder_module='transformer', attention_type='regularMHA', max_length=2500, causal=False, pad_idx=0, encoder_kdim=None, encoder_vdim=None, decoder_kdim=None, decoder_vdim=None, use_word_emb=False, word_emb_enc=None)[source]

Bases: TransformerInterface

A Transformer-based Grapheme-to-Phoneme model

Parameters:
  • emb (torch.nn.Module) – the embedding module

  • encoder_emb (torch.nn.Module) – the encoder embedding module

  • char_lin (torch.nn.Module) – a linear module connecting the inputs to the transformer

  • phn_lin (torch.nn.Module) – a linear module connecting the outputs to the transformer

  • out (torch.nn.Module) – the decoder module (usually Softmax)

  • lin (torch.nn.Module) – the linear module for outputs

  • d_model (int) – The number of expected features in the encoder/decoder inputs (default=512).

  • nhead (int) – The number of heads in the multi-head attention models (default=8).

  • num_encoder_layers (int, optional) – The number of encoder layers in1ì the encoder.

  • num_decoder_layers (int, optional) – The number of decoder layers in the decoder.

  • dim_ffn (int, optional) – The dimension of the feedforward network model hidden layer.

  • dropout (int, optional) – The dropout value.

  • activation (torch.nn.Module, optional) – The activation function for Feed-Forward Netowrk layer, e.g., relu or gelu or swish.

  • custom_src_module (torch.nn.Module, optional) – Module that processes the src features to expected feature dim.

  • custom_tgt_module (torch.nn.Module, optional) – Module that processes the src features to expected feature dim.

  • positional_encoding (str, optional) – Type of positional encoding used. e.g. ‘fixed_abs_sine’ for fixed absolute positional encodings.

  • normalize_before (bool, optional) – Whether normalization should be applied before or after MHA or FFN in Transformer layers. Defaults to True as this was shown to lead to better performance and training stability.

  • kernel_size (int, optional) – Kernel size in convolutional layers when Conformer is used.

  • bias (bool, optional) – Whether to use bias in Conformer convolutional layers.

  • encoder_module (str, optional) – Choose between Conformer and Transformer for the encoder. The decoder is fixed to be a Transformer.

  • conformer_activation (torch.nn.Module, optional) – Activation module used after Conformer convolutional layers. E.g. Swish, ReLU etc. it has to be a torch Module.

  • attention_type (str, optional) – Type of attention layer used in all Transformer or Conformer layers. e.g. regularMHA or RelPosMHA.

  • max_length (int, optional) – Max length for the target and source sequence in input. Used for positional encodings.

  • causal (bool, optional) – Whether the encoder should be causal or not (the decoder is always causal). If causal the Conformer convolutional layer is causal.

  • pad_idx (int) – the padding index (for masks)

  • encoder_kdim (int, optional) – Dimension of the key for the encoder.

  • encoder_vdim (int, optional) – Dimension of the value for the encoder.

  • decoder_kdim (int, optional) – Dimension of the key for the decoder.

  • decoder_vdim (int, optional) – Dimension of the value for the decoder.

forward(grapheme_encoded, phn_encoded=None, word_emb=None, **kwargs)[source]

Computes the forward pass

Parameters:
  • grapheme_encoded (torch.Tensor) – graphemes encoded as a Torch tensor

  • phn_encoded (torch.Tensor) – the encoded phonemes

  • word_emb (torch.Tensor) – word embeddings (if applicable)

Returns:

  • p_seq (torch.Tensor) – the log-probabilities of individual tokens i a sequence

  • char_lens (torch.Tensor) – the character length syntax

  • encoder_out (torch.Tensor) – the encoder state

  • attention (torch.Tensor) – the attention state

make_masks(src, tgt, src_len=None, pad_idx=0)[source]

This method generates the masks for training the transformer model.

Parameters:
  • src (tensor) – The sequence to the encoder (required).

  • tgt (tensor) – The sequence to the decoder (required).

  • pad_idx (int) – The index for <pad> token (default=0).

Returns:

  • src_key_padding_mask (torch.Tensor) – the source key padding mask

  • tgt_key_padding_mask (torch.Tensor) – the target key padding masks

  • src_mask (torch.Tensor) – the source mask

  • tgt_mask (torch.Tensor) – the target mask

decode(tgt, encoder_out)[source]

This method implements a decoding step for the transformer model.

Parameters:
Returns:

  • prediction (torch.Tensor) – the predicted sequence

  • attention (torch.Tensor) – the attention matrix corresponding to the last attention head (useful for plotting attention)

training: bool
speechbrain.lobes.models.g2p.model.input_dim(use_word_emb, embedding_dim, word_emb_enc_dim)[source]

Computes the input dimension (intended for hparam files)

Parameters:
  • use_word_emb (bool) – whether to use word embeddings

  • embedding_dim (int) – the embedding dimension

  • word_emb_enc_dim (int) – the dimension of encoded word embeddings

Returns:

input_dim – the input dimension

Return type:

int

speechbrain.lobes.models.g2p.model.get_dummy_phonemes(batch_size, device)[source]

Creates a dummy phoneme sequence

Parameters:
  • batch_size (int) – the batch size

  • device (str) – the target device

Returns:

result

Return type:

torch.Tensor