speechbrain.lobes.models.MSTacotron2 module

Neural network modules for the Zero-Shot Multi-Speaker Tacotron2 end-to-end neural Text-to-Speech (TTS) model

Authors * Georges Abous-Rjeili 2021 * Artem Ploujnikov 2021 * Pradnya Kandarkar 2023

Summary

Classes:

Loss

The Tacotron loss implementation The loss consists of an MSE loss on the spectrogram, a BCE gate loss and a guided attention loss (if enabled) that attempts to make the attention matrix diagonal The output of the module is a LossStats tuple, which includes both the total loss

LossStats

alias of TacotronLoss

Tacotron2

The Tactron2 text-to-speech model, based on the NVIDIA implementation.

TextMelCollate

Zero-pads model inputs and targets based on number of frames per step

Reference

class speechbrain.lobes.models.MSTacotron2.Tacotron2(spk_emb_size, mask_padding=True, n_mel_channels=80, n_symbols=148, symbols_embedding_dim=512, encoder_kernel_size=5, encoder_n_convolutions=3, encoder_embedding_dim=512, attention_rnn_dim=1024, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31, n_frames_per_step=1, decoder_rnn_dim=1024, prenet_dim=256, max_decoder_steps=1000, gate_threshold=0.5, p_attention_dropout=0.1, p_decoder_dropout=0.1, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, decoder_no_early_stopping=False)[source]

Bases: Module

The Tactron2 text-to-speech model, based on the NVIDIA implementation.

This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers

Simplified STRUCTURE: phoneme input->token embedding ->encoder -> (encoder output + speaker embedding) ->attention ->decoder(+prenet) -> postnet ->output

prenet(input is decoder previous time step) output is input to decoder concatenated with the attention output

Parameters:
  • spk_emb_size (int) – Speaker embedding size

  • mask_padding (bool) – whether or not to mask pad-outputs of tacotron

  • n_mel_channels (int) – number of mel channels for constructing spectrogram

  • n_symbols (int=128) – number of accepted char symbols defined in textToSequence

  • symbols_embedding_dim (int) – number of embedding dimension for symbols fed to nn.Embedding

  • encoder_kernel_size (int) – size of kernel processing the embeddings

  • encoder_n_convolutions (int) – number of convolution layers in encoder

  • encoder_embedding_dim (int) – number of kernels in encoder, this is also the dimension of the bidirectional LSTM in the encoder

  • attention_rnn_dim (int) – input dimension

  • attention_dim (int) – number of hidden representation in attention

  • attention_location_n_filters (int) – number of 1-D convolution filters in attention

  • attention_location_kernel_size (int) – length of the 1-D convolution filters

  • n_frames_per_step (int=1) – only 1 generated mel-frame per step is supported for the decoder as of now.

  • decoder_rnn_dim (int) – number of 2 unidirectional stacked LSTM units

  • prenet_dim (int) – dimension of linear prenet layers

  • max_decoder_steps (int) – maximum number of steps/frames the decoder generates before stopping

  • gate_threshold (int) – cut off level any output probability above that is considered complete and stops generation so we have variable length outputs

  • p_attention_dropout (float) – attention drop out probability

  • p_decoder_dropout (float) – decoder drop out probability

  • postnet_embedding_dim (int) – number os postnet dfilters

  • postnet_kernel_size (int) – 1d size of posnet kernel

  • postnet_n_convolutions (int) – number of convolution layers in postnet

  • decoder_no_early_stopping (bool) – determines early stopping of decoder along with gate_threshold . The logical inverse of this is fed to the decoder

Example

>>> import torch
>>> _ = torch.manual_seed(213312)
>>> from speechbrain.lobes.models.Tacotron2 import Tacotron2
>>> model = Tacotron2(
...    mask_padding=True,
...    n_mel_channels=80,
...    n_symbols=148,
...    symbols_embedding_dim=512,
...    encoder_kernel_size=5,
...    encoder_n_convolutions=3,
...    encoder_embedding_dim=512,
...    attention_rnn_dim=1024,
...    attention_dim=128,
...    attention_location_n_filters=32,
...    attention_location_kernel_size=31,
...    n_frames_per_step=1,
...    decoder_rnn_dim=1024,
...    prenet_dim=256,
...    max_decoder_steps=32,
...    gate_threshold=0.5,
...    p_attention_dropout=0.1,
...    p_decoder_dropout=0.1,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    decoder_no_early_stopping=False
... )
>>> _ = model.eval()
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> input_lengths = torch.tensor([5, 4])
>>> outputs, output_lengths, alignments = model.infer(inputs, input_lengths)
>>> outputs.shape, output_lengths.shape, alignments.shape
(torch.Size([2, 80, 1]), torch.Size([2]), torch.Size([2, 1, 5]))
parse_output(outputs, output_lengths, alignments_dim=None)[source]

Masks the padded part of output

Parameters:
  • outputs (list) – a list of tensors - raw outputs

  • output_lengths (torch.Tensor) – a tensor representing the lengths of all outputs

  • alignments_dim (int) – the desired dimension of the alignments along the last axis Optional but needed for data-parallel training

Returns:

  • mel_outputs (torch.Tensor)

  • mel_outputs_postnet (torch.Tensor)

  • gate_outputs (torch.Tensor)

  • alignments (torch.Tensor)

  • output_lengths (torch.Tensor) – the original outputs - with the mask applied

forward(inputs, spk_embs, alignments_dim=None)[source]

Decoder forward pass for training

Parameters:
  • inputs (tuple) – batch object

  • spk_embs (torch.Tensor) – Speaker embeddings corresponding to the inputs

  • alignments_dim (int) – the desired dimension of the alignments along the last axis Optional but needed for data-parallel training

Returns:

  • mel_outputs (torch.Tensor) – mel outputs from the decoder

  • mel_outputs_postnet (torch.Tensor) – mel outputs from postnet

  • gate_outputs (torch.Tensor) – gate outputs from the decoder

  • alignments (torch.Tensor) – sequence of attention weights from the decoder

  • output_lengths (torch.Tensor) – length of the output without padding

infer(inputs, spk_embs, input_lengths)[source]

Produces outputs

Parameters:
  • inputs (torch.tensor) – text or phonemes converted

  • spk_embs (torch.Tensor) – Speaker embeddings corresponding to the inputs

  • input_lengths (torch.tensor) – the lengths of input parameters

Returns:

  • mel_outputs_postnet (torch.Tensor) – final mel output of tacotron 2

  • mel_lengths (torch.Tensor) – length of mels

  • alignments (torch.Tensor) – sequence of attention weights

speechbrain.lobes.models.MSTacotron2.LossStats

alias of TacotronLoss

class speechbrain.lobes.models.MSTacotron2.Loss(guided_attention_sigma=None, gate_loss_weight=1.0, mel_loss_weight=1.0, spk_emb_loss_weight=1.0, spk_emb_loss_type=None, guided_attention_weight=1.0, guided_attention_scheduler=None, guided_attention_hard_stop=None)[source]

Bases: Module

The Tacotron loss implementation The loss consists of an MSE loss on the spectrogram, a BCE gate loss and a guided attention loss (if enabled) that attempts to make the attention matrix diagonal The output of the module is a LossStats tuple, which includes both the total loss

Parameters:
  • guided_attention_sigma (float) – The guided attention sigma factor, controlling the β€œwidth” of the mask

  • gate_loss_weight (float) – The constant by which the gate loss will be multiplied

  • mel_loss_weight (float) – The constant by which the mel loss will be multiplied

  • spk_emb_loss_weight (float) – The constant by which the speaker embedding loss will be multiplied - placeholder for future work

  • spk_emb_loss_type (str) – Type of the speaker embedding loss - placeholder for future work

  • guided_attention_weight (float) – The weight for the guided attention

  • guided_attention_scheduler (callable) – The scheduler class for the guided attention loss

  • guided_attention_hard_stop (int) – The number of epochs after which guided attention will be completely turned off

Example

>>> import torch
>>> _ = torch.manual_seed(42)
>>> from speechbrain.lobes.models.MSTacotron2 import Loss
>>> loss = Loss(guided_attention_sigma=0.2)
>>> mel_target = torch.randn(2, 80, 861)
>>> gate_target = torch.randn(1722, 1)
>>> mel_out = torch.randn(2, 80, 861)
>>> mel_out_postnet = torch.randn(2, 80, 861)
>>> gate_out = torch.randn(2, 861)
>>> alignments = torch.randn(2, 861, 173)
>>> pred_mel_lens = torch.randn(2)
>>> targets = mel_target, gate_target
>>> model_outputs = mel_out, mel_out_postnet, gate_out, alignments, pred_mel_lens
>>> input_lengths = torch.tensor([173,  91])
>>> target_lengths = torch.tensor([861, 438])
>>> spk_embs = None
>>> loss(model_outputs, targets, input_lengths, target_lengths, spk_embs, 1)
TacotronLoss(loss=tensor([4.8566]), mel_loss=tensor(4.0097), spk_emb_loss=tensor([0.]), gate_loss=tensor(0.8460), attn_loss=tensor(0.0010), attn_weight=tensor(1.))
forward(model_output, targets, input_lengths, target_lengths, spk_embs, epoch)[source]

Computes the loss :param model_output: the output of the model’s forward():

(mel_outputs, mel_outputs_postnet, gate_outputs, alignments)

Parameters:
  • targets (tuple) – the targets

  • input_lengths (torch.Tensor) – a (batch, length) tensor of input lengths

  • target_lengths (torch.Tensor) – a (batch, length) tensor of target (spectrogram) lengths

  • spk_embs (torch.Tensor) – Speaker embedding input for the loss computation - placeholder for future work

  • epoch (int) – the current epoch number (used for the scheduling of the guided attention loss) A StepScheduler is typically used

Returns:

result – the total loss - and individual losses (mel and gate)

Return type:

LossStats

get_attention_loss(alignments, input_lengths, target_lengths, epoch)[source]

Computes the attention loss :param alignments: the alignment matrix from the model :type alignments: torch.Tensor :param input_lengths: a (batch, length) tensor of input lengths :type input_lengths: torch.Tensor :param target_lengths: a (batch, length) tensor of target (spectrogram) lengths :type target_lengths: torch.Tensor :param epoch: the current epoch number (used for the scheduling of the guided attention

loss) A StepScheduler is typically used

Returns:

attn_loss – the attention loss value

Return type:

torch.Tensor

class speechbrain.lobes.models.MSTacotron2.TextMelCollate(speaker_embeddings_pickle, n_frames_per_step=1)[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step

Parameters:
  • speaker_embeddings_pickle (str) – Path to the file containing speaker embeddings

  • n_frames_per_step (int) – The number of output frames per step

__call__(batch)[source]

Collate’s training batch from normalized text and mel-spectrogram

Parameters:

batch (list) – [text_normalized, mel_normalized]

Returns:

  • text_padded (torch.Tensor)

  • input_lengths (torch.Tensor)

  • mel_padded (torch.Tensor)

  • gate_padded (torch.Tensor)

  • output_lengths (torch.Tensor)

  • len_x (torch.Tensor)

  • labels (torch.Tensor)

  • wavs (torch.Tensor)

  • spk_embs (torch.Tensor)

  • spk_ids (torch.Tensor)