speechbrain.lobes.models.g2p.model module

The Attentional RNN model for Grapheme-to-Phoneme

Authors

Mirco Ravinelli 2021
Artem Ploujnikov 2021

Summary

Classes:

`AttentionSeq2Seq`	The Attentional RNN encoder-decoder model
`TransformerG2P`	A Transformer-based Grapheme-to-Phoneme model
`WordEmbeddingEncoder`	A small encoder module that reduces the dimensionality and normalizes word embeddings

Functions:

`get_dummy_phonemes`	Creates a dummy phoneme sequence
`input_dim`	Computes the input dimension (intended for hparam files)

Reference

class speechbrain.lobes.models.g2p.model.AttentionSeq2Seq(enc, encoder_emb, emb, dec, lin, out, bos_token=0, use_word_emb=False, word_emb_enc=None)[source]

Bases: Module

The Attentional RNN encoder-decoder model

Parameters:

enc (torch.nn.Module) – the encoder module
encoder_emb (torch.nn.Module) – the encoder_embedding_module
emb (torch.nn.Module) – the embedding module
dec (torch.nn.Module) – the decoder module
lin (torch.nn.Module) – the linear module
out (torch.nn.Module) – the output layer (typically log_softmax)
use_word_emb (bool) – whether or not to use word embedding
bos_token (int) – the index of teh Beginning-of-Sentence token
word_emb_enc (nn.Module) – a module to encode word embeddings

Returns:

result – a (p_seq, char_lens) tuple

Return type:

tuple

forward(grapheme_encoded, phn_encoded=None, word_emb=None, **kwargs)[source]

Computes the forward pass

Parameters:

grapheme_encoded (torch.Tensor) – graphemes encoded as a Torch tensor
phn_encoded (torch.Tensor) – the encoded phonemes
word_emb (torch.Tensor) – word embeddings (optional)

Returns:

p_seq (torch.Tensor) – a (batch x position x token) tensor of token probabilities in each position
char_lens (torch.Tensor) – a tensor of character sequence lengths
encoder_out – the raw output of the encoder

training: bool

class speechbrain.lobes.models.g2p.model.WordEmbeddingEncoder(word_emb_dim, word_emb_enc_dim, norm=None, norm_type=None)[source]

Bases: Module

A small encoder module that reduces the dimensionality and normalizes word embeddings

Parameters:

word_emb_dim (int) – the dimension of the original word embeddings
word_emb_enc_dim (int) – the dimension of the encoded word embeddings
norm (torch.nn.Module) –

the normalization to be used (
e.g. speechbrain.nnet.normalization.LayerNorm)
norm_type (str) – the type of normalization to be used

forward(emb)[source]

Computes the forward pass of the embedding

Parameters:: emb (torch.Tensor) – the original word embeddings
Returns:: emb_enc – encoded word embeddings
Return type:: torch.Tensor

NORMS = {'batch': <class 'speechbrain.nnet.normalization.BatchNorm1d'>, 'instance': <class 'speechbrain.nnet.normalization.InstanceNorm1d'>, 'layer': <class 'speechbrain.nnet.normalization.LayerNorm'>}

training: bool

class speechbrain.lobes.models.g2p.model.TransformerG2P(emb, encoder_emb, char_lin, phn_lin, lin, out, d_model=512, nhead=8, num_encoder_layers=6, num_decoder_layers=6, d_ffn=2048, dropout=0.1, activation=<class 'torch.nn.modules.activation.ReLU'>, custom_src_module=None, custom_tgt_module=None, positional_encoding='fixed_abs_sine', normalize_before=True, kernel_size=15, bias=True, encoder_module='transformer', attention_type='regularMHA', max_length=2500, causal=False, pad_idx=0, encoder_kdim=None, encoder_vdim=None, decoder_kdim=None, decoder_vdim=None, use_word_emb=False, word_emb_enc=None)[source]

Bases: TransformerInterface

A Transformer-based Grapheme-to-Phoneme model

Parameters:

emb (torch.nn.Module) – the embedding module
encoder_emb (torch.nn.Module) – the encoder embedding module
char_lin (torch.nn.Module) – a linear module connecting the inputs to the transformer
phn_lin (torch.nn.Module) – a linear module connecting the outputs to the transformer
out (torch.nn.Module) – the decoder module (usually Softmax)
lin (torch.nn.Module) – the linear module for outputs
d_model (int) – The number of expected features in the encoder/decoder inputs (default=512).
nhead (int) – The number of heads in the multi-head attention models (default=8).
num_encoder_layers (int, optional) – The number of encoder layers in1ì the encoder.
num_decoder_layers (int, optional) – The number of decoder layers in the decoder.
dim_ffn (int, optional) – The dimension of the feedforward network model hidden layer.
dropout (int, optional) – The dropout value.
activation (torch.nn.Module, optional) – The activation function for Feed-Forward Netowrk layer, e.g., relu or gelu or swish.
custom_src_module (torch.nn.Module, optional) – Module that processes the src features to expected feature dim.
custom_tgt_module (torch.nn.Module, optional) – Module that processes the src features to expected feature dim.
positional_encoding (str, optional) – Type of positional encoding used. e.g. ‘fixed_abs_sine’ for fixed absolute positional encodings.
normalize_before (bool, optional) – Whether normalization should be applied before or after MHA or FFN in Transformer layers. Defaults to True as this was shown to lead to better performance and training stability.
kernel_size (int, optional) – Kernel size in convolutional layers when Conformer is used.
bias (bool, optional) – Whether to use bias in Conformer convolutional layers.
encoder_module (str, optional) – Choose between Conformer and Transformer for the encoder. The decoder is fixed to be a Transformer.
conformer_activation (torch.nn.Module, optional) – Activation module used after Conformer convolutional layers. E.g. Swish, ReLU etc. it has to be a torch Module.
attention_type (str, optional) – Type of attention layer used in all Transformer or Conformer layers. e.g. regularMHA or RelPosMHA.
max_length (int, optional) – Max length for the target and source sequence in input. Used for positional encodings.
causal (bool, optional) – Whether the encoder should be causal or not (the decoder is always causal). If causal the Conformer convolutional layer is causal.
pad_idx (int) – the padding index (for masks)
encoder_kdim (int, optional) – Dimension of the key for the encoder.
encoder_vdim (int, optional) – Dimension of the value for the encoder.
decoder_kdim (int, optional) – Dimension of the key for the decoder.
decoder_vdim (int, optional) – Dimension of the value for the decoder.

forward(grapheme_encoded, phn_encoded=None, word_emb=None, **kwargs)[source]

Computes the forward pass

Parameters:

grapheme_encoded (torch.Tensor) – graphemes encoded as a Torch tensor
phn_encoded (torch.Tensor) – the encoded phonemes
word_emb (torch.Tensor) – word embeddings (if applicable)

Returns:

p_seq (torch.Tensor) – the log-probabilities of individual tokens i a sequence
char_lens (torch.Tensor) – the character length syntax
encoder_out (torch.Tensor) – the encoder state
attention (torch.Tensor) – the attention state

make_masks(src, tgt, src_len=None, pad_idx=0)[source]

This method generates the masks for training the transformer model.

Parameters:

src (tensor) – The sequence to the encoder (required).
tgt (tensor) – The sequence to the decoder (required).
pad_idx (int) – The index for <pad> token (default=0).

Returns:

src_key_padding_mask (torch.Tensor) – the source key padding mask
tgt_key_padding_mask (torch.Tensor) – the target key padding masks
src_mask (torch.Tensor) – the source mask
tgt_mask (torch.Tensor) – the target mask

decode(tgt, encoder_out)[source]

This method implements a decoding step for the transformer model.

Parameters:

tgt (torch.Tensor) – The sequence to the decoder.
encoder_out (torch.Tensor) – Hidden output of the encoder.

Returns:

prediction (torch.Tensor) – the predicted sequence
attention (torch.Tensor) – the attention matrix corresponding to the last attention head (useful for plotting attention)

training: bool

speechbrain.lobes.models.g2p.model.input_dim(use_word_emb, embedding_dim, word_emb_enc_dim)[source]

Computes the input dimension (intended for hparam files)

Parameters:

use_word_emb (bool) – whether to use word embeddings
embedding_dim (int) – the embedding dimension
word_emb_enc_dim (int) – the dimension of encoded word embeddings

Returns:

input_dim – the input dimension

Return type:

int

speechbrain.lobes.models.g2p.model.get_dummy_phonemes(batch_size, device)[source]

Creates a dummy phoneme sequence

Parameters:

batch_size (int) – the batch size
device (str) – the target device

Returns:

result

Return type:

torch.Tensor