speechbrain.lobes.models.MSTacotron2 module

Neural network modules for the Zero-Shot Multi-Speaker Tacotron2 end-to-end neural Text-to-Speech (TTS) model

Authors * Georges Abous-Rjeili 2021 * Artem Ploujnikov 2021 * Pradnya Kandarkar 2023

Summary

Classes:

`Loss`	The Tacotron loss implementation The loss consists of an MSE loss on the spectrogram, a BCE gate loss and a guided attention loss (if enabled) that attempts to make the attention matrix diagonal The output of the moduel is a LossStats tuple, which includes both the total loss :param guided_attention_sigma: The guided attention sigma factor, controling the "width" of the mask :type guided_attention_sigma: float :param gate_loss_weight: The constant by which the gate loss will be multiplied :type gate_loss_weight: float :param mel_loss_weight: The constant by which the mel loss will be multiplied :type mel_loss_weight: float :param spk_emb_loss_weight: The constant by which the speaker embedding loss will be multiplied - placeholder for future work :type spk_emb_loss_weight: float :param spk_emb_loss_type: Type of the speaker embedding loss - placeholder for future work :type spk_emb_loss_type: str :param guided_attention_weight: The weight for the guided attention :type guided_attention_weight: float :param guided_attention_scheduler: The scheduler class for the guided attention loss :type guided_attention_scheduler: callable :param guided_attention_hard_stop: The number of epochs after which guided attention will be compeltely turned off :type guided_attention_hard_stop: int :param Example: :param >>> import torch: :param >>> _ = torch.manual_seed(42): :param >>> from speechbrain.lobes.models.MSTacotron2 import Loss: :param >>> loss = Loss(guided_attention_sigma=0.2): :param >>> mel_target = torch.randn(2: :param 80: :param 861): :param >>> gate_target = torch.randn(1722: :param 1): :param >>> mel_out = torch.randn(2: :param 80: :param 861): :param >>> mel_out_postnet = torch.randn(2: :param 80: :param 861): :param >>> gate_out = torch.randn(2: :param 861): :param >>> alignments = torch.randn(2: :param 861: :param 173): :param >>> pred_mel_lens = torch.randn(2): :param >>> targets = mel_target: :param gate_target: :param >>> model_outputs = mel_out: :param mel_out_postnet: :param gate_out: :param alignments: :param pred_mel_lens: :param >>> input_lengths = torch.tensor([173: :param 91]): :param >>> target_lengths = torch.tensor([861: :param 438]): :param >>> spk_embs = None: :param >>> loss(model_outputs: :param targets: :param input_lengths: :param target_lengths: :param spk_embs: :param 1): :param TacotronLoss(loss=tensor([4.8566]): :param mel_loss=tensor(4.0097): :param spk_emb_loss=tensor([0.]): :param gate_loss=tensor(0.8460): :param attn_loss=tensor(0.0010): :param attn_weight=tensor(1.)):
`LossStats`	alias of `TacotronLoss`
`Tacotron2`	The Tactron2 text-to-speech model, based on the NVIDIA implementation.
`TextMelCollate`	Zero-pads model inputs and targets based on number of frames per step :param speaker_embeddings_pickle: Path to the file containing speaker embeddings :type speaker_embeddings_pickle: str :param n_frames_per_step: The number of output frames per step :type n_frames_per_step: int

Reference

class speechbrain.lobes.models.MSTacotron2.Tacotron2(spk_emb_size, mask_padding=True, n_mel_channels=80, n_symbols=148, symbols_embedding_dim=512, encoder_kernel_size=5, encoder_n_convolutions=3, encoder_embedding_dim=512, attention_rnn_dim=1024, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31, n_frames_per_step=1, decoder_rnn_dim=1024, prenet_dim=256, max_decoder_steps=1000, gate_threshold=0.5, p_attention_dropout=0.1, p_decoder_dropout=0.1, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, decoder_no_early_stopping=False)[source]

Bases: Module

The Tactron2 text-to-speech model, based on the NVIDIA implementation.

This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers

Simplified STRUCTURE: phoneme input->token embedding ->encoder -> (encoder output + speaker embedding) ->attention ->decoder(+prenet) -> postnet ->output

prenet(input is decoder previous time step) output is input to decoder concatenanted with the attention output

Parameters:

spk_emb_size (int) – Speaker embedding size
mask_padding (bool) – whether or not to mask pad-outputs of tacotron
io (#mel generation parameter in data) –
n_mel_channels (int) – number of mel channels for constructing spectrogram
#symbols –
n_symbols (int=128) – number of accepted char symbols defined in textToSequence
symbols_embedding_dim (int) – number of embeding dimension for symbols fed to nn.Embedding
parameters (# Decoder) –
encoder_kernel_size (int) – size of kernel processing the embeddings
encoder_n_convolutions (int) – number of convolution layers in encoder
encoder_embedding_dim (int) – number of kernels in encoder, this is also the dimension of the bidirectional LSTM in the encoder
parameters –
attention_rnn_dim (int) – input dimension
attention_dim (int) – number of hidden represetation in attention
parameters –
attention_location_n_filters (int) – number of 1-D convulation filters in attention
attention_location_kernel_size (int) – length of the 1-D convolution filters
parameters –
n_frames_per_step (int=1) – only 1 generated mel-frame per step is supported for the decoder as of now.
decoder_rnn_dim (int) – number of 2 unidirectionnal stacked LSTM units
prenet_dim (int) – dimension of linear prenet layers
max_decoder_steps (int) – maximum number of steps/frames the decoder generates before stopping
p_attention_dropout (float) – attention drop out probability
p_decoder_dropout (float) – decoder drop out probability
gate_threshold (int) – cut off level any output probabilty above that is considered complete and stops genration so we have variable length outputs
decoder_no_early_stopping (bool) – determines early stopping of decoder along with gate_threshold . The logical inverse of this is fed to the decoder

#Mel-post processing network parameters postnet_embedding_dim: int

number os postnet dfilters

postnet_kernel_size: int: 1d size of posnet kernel
postnet_n_convolutions: int: number of convolution layers in postnet

Example

>>> import torch
>>> _ = torch.manual_seed(213312)
>>> from speechbrain.lobes.models.Tacotron2 import Tacotron2
>>> model = Tacotron2(
...    mask_padding=True,
...    n_mel_channels=80,
...    n_symbols=148,
...    symbols_embedding_dim=512,
...    encoder_kernel_size=5,
...    encoder_n_convolutions=3,
...    encoder_embedding_dim=512,
...    attention_rnn_dim=1024,
...    attention_dim=128,
...    attention_location_n_filters=32,
...    attention_location_kernel_size=31,
...    n_frames_per_step=1,
...    decoder_rnn_dim=1024,
...    prenet_dim=256,
...    max_decoder_steps=32,
...    gate_threshold=0.5,
...    p_attention_dropout=0.1,
...    p_decoder_dropout=0.1,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    decoder_no_early_stopping=False
... )
>>> _ = model.eval()
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> input_lengths = torch.tensor([5, 4])
>>> outputs, output_lengths, alignments = model.infer(inputs, input_lengths)
>>> outputs.shape, output_lengths.shape, alignments.shape
(torch.Size([2, 80, 1]), torch.Size([2]), torch.Size([2, 1, 5]))

parse_output(outputs, output_lengths, alignments_dim=None)[source]

Masks the padded part of output

Parameters:

outputs (list) – a list of tensors - raw outputs
output_lengths (torch.Tensor) – a tensor representing the lengths of all outputs
alignments_dim (int) – the desired dimension of the alignments along the last axis Optional but needed for data-parallel training

Returns:

result – a (mel_outputs, mel_outputs_postnet, gate_outputs, alignments) tuple with the original outputs - with the mask applied

Return type:

tuple

forward(inputs, spk_embs, alignments_dim=None)[source]

Decoder forward pass for training

Parameters:

inputs (tuple) – batch object
spk_embs (torch.Tensor) – Speaker embeddings corresponding to the inputs
alignments_dim (int) – the desired dimension of the alignments along the last axis Optional but needed for data-parallel training

Returns:

mel_outputs (torch.Tensor) – mel outputs from the decoder
mel_outputs_postnet (torch.Tensor) – mel outputs from postnet
gate_outputs (torch.Tensor) – gate outputs from the decoder
alignments (torch.Tensor) – sequence of attention weights from the decoder
output_legnths (torch.Tensor) – length of the output without padding

infer(inputs, spk_embs, input_lengths)[source]

Produces outputs

Parameters:

inputs (torch.tensor) – text or phonemes converted
spk_embs (torch.Tensor) – Speaker embeddings corresponding to the inputs
input_lengths (torch.tensor) – the lengths of input parameters

Returns:

mel_outputs_postnet (torch.Tensor) – final mel output of tacotron 2
mel_lengths (torch.Tensor) – length of mels
alignments (torch.Tensor) – sequence of attention weights

training: bool

speechbrain.lobes.models.MSTacotron2.LossStats: alias of TacotronLoss

class speechbrain.lobes.models.MSTacotron2.Loss(guided_attention_sigma=None, gate_loss_weight=1.0, mel_loss_weight=1.0, spk_emb_loss_weight=1.0, spk_emb_loss_type=None, guided_attention_weight=1.0, guided_attention_scheduler=None, guided_attention_hard_stop=None)[source]

Bases: Module

The Tacotron loss implementation The loss consists of an MSE loss on the spectrogram, a BCE gate loss and a guided attention loss (if enabled) that attempts to make the attention matrix diagonal The output of the moduel is a LossStats tuple, which includes both the total loss :param guided_attention_sigma: The guided attention sigma factor, controling the “width” of

the mask

Parameters:

gate_loss_weight (float) – The constant by which the gate loss will be multiplied
mel_loss_weight (float) – The constant by which the mel loss will be multiplied
spk_emb_loss_weight (float) – The constant by which the speaker embedding loss will be multiplied - placeholder for future work
spk_emb_loss_type (str) – Type of the speaker embedding loss - placeholder for future work
guided_attention_weight (float) – The weight for the guided attention
guided_attention_scheduler (callable) – The scheduler class for the guided attention loss
guided_attention_hard_stop (int) – The number of epochs after which guided attention will be compeltely turned off
Example –
torch (>>> import) –
torch.manual_seed(42) (>>> _ =) –
Loss (>>> from speechbrain.lobes.models.MSTacotron2 import) –
Loss(guided_attention_sigma=0.2) (>>> loss =) –
torch.randn(2 (>>> alignments =) –
80 –
861) –
torch.randn(1722 (>>> gate_target =) –
1) –
torch.randn(2 –
80 –
861) –
torch.randn(2 –
80 –
861) –
torch.randn(2 –
861) –
torch.randn(2 –
861 –
173) –
torch.randn(2) (>>> pred_mel_lens =) –
mel_target (>>> targets =) –
gate_target –
mel_out (>>> model_outputs =) –
mel_out_postnet –
gate_out –
alignments –
pred_mel_lens –
torch.tensor([173 (>>> input_lengths =) –
91]) –
torch.tensor([861 (>>> target_lengths =) –
438]) –
None (>>> spk_embs =) –
loss(model_outputs (>>>) –
targets –
input_lengths –
target_lengths –
spk_embs –
1) –
TacotronLoss(loss=tensor([4.8566]) –
mel_loss=tensor(4.0097) –
spk_emb_loss=tensor([0.]) –
gate_loss=tensor(0.8460) –
attn_loss=tensor(0.0010) –
attn_weight=tensor(1.)) –

forward(model_output, targets, input_lengths, target_lengths, spk_embs, epoch)[source]

Computes the loss :param model_output: the output of the model’s forward():

(mel_outputs, mel_outputs_postnet, gate_outputs, alignments)

Parameters:

targets (tuple) – the targets
input_lengths (torch.Tensor) – a (batch, length) tensor of input lengths
target_lengths (torch.Tensor) – a (batch, length) tensor of target (spectrogram) lengths
spk_embs (torch.Tensor) – Speaker embedding input for the loss computation - placeholder for future work
epoch (int) – the current epoch number (used for the scheduling of the guided attention loss) A StepScheduler is typically used

Returns:

result – the total loss - and individual losses (mel and gate)

Return type:

LossStats

get_attention_loss(alignments, input_lengths, target_lengths, epoch)[source]

Computes the attention loss :param alignments: the aligment matrix from the model :type alignments: torch.Tensor :param input_lengths: a (batch, length) tensor of input lengths :type input_lengths: torch.Tensor :param target_lengths: a (batch, length) tensor of target (spectrogram) lengths :type target_lengths: torch.Tensor :param epoch: the current epoch number (used for the scheduling of the guided attention

loss) A StepScheduler is typically used

Returns:: attn_loss – the attention loss value
Return type:: torch.Tensor

training: bool

class speechbrain.lobes.models.MSTacotron2.TextMelCollate(speaker_embeddings_pickle, n_frames_per_step=1)[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step :param speaker_embeddings_pickle: Path to the file containing speaker embeddings :type speaker_embeddings_pickle: str :param n_frames_per_step: The number of output frames per step :type n_frames_per_step: int

Returns:

result – a tuple inputs/targets (

text_padded, input_lengths, mel_padded, gate_padded, output_lengths, len_x, labels, wavs, spk_embs, spk_ids

)

Return type:

tuple

__call__(batch)[source]: Collate’s training batch from normalized text and mel-spectrogram :param batch: [text_normalized, mel_normalized] :type batch: list