speechbrain.lobes.models.Tacotron2 module

Neural network modules for the Tacotron2 end-to-end neural Text-to-Speech (TTS) model

Authors * Georges Abous-Rjeili 2021 * Artem Ploujnikov 2021

Summary

Classes:

Attention

The Tacotron attention layer.

ConvNorm

A 1D convolution layer with Xavier initialization

Decoder

The Tacotron decoder

Encoder

The Tacotron2 encoder module, consisting of a sequence of 1-d convolution banks (3 by default) and a bidirectional LSTM

LinearNorm

A linear layer with Xavier initialization

LocationLayer

A location-based attention layer consisting of a Xavier-initialized convolutional layer followed by a dense layer

Loss

The Tacotron loss implementation

LossStats

alias of TacotronLoss

Postnet

The Tacotron postnet consists of a number of 1-d convolutional layers with Xavier initialization and a tanh activation, with batch normalization.

Prenet

The Tacotron pre-net module consisting of a specified number of normalized (Xavier-initialized) linear layers

Tacotron2

The Tactron2 text-to-speech model, based on the NVIDIA implementation.

TextMelCollate

Zero-pads model inputs and targets based on number of frames per step

Functions:

dynamic_range_compression

Dynamic range compression for audio signals

get_mask_from_lengths

Creates a mask from a tensor of lengths

infer

An inference hook for pretrained synthesizers

mel_spectogram

calculates MelSpectrogram for a raw audio signal

Reference

class speechbrain.lobes.models.Tacotron2.LinearNorm(in_dim, out_dim, bias=True, w_init_gain='linear')[source]

Bases: Module

A linear layer with Xavier initialization

Parameters
  • in_dim (int) – the input dimension

  • out_dim (int) – the output dimension

  • bias (bool) – whether or not to use a bias

  • w_init_gain (linear) – the weight initialization gain type (see torch.nn.init.calculate_gain)

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import Tacotron2
>>> layer = LinearNorm(in_dim=5, out_dim=3)
>>> x = torch.randn(3, 5)
>>> y = layer(x)
>>> y.shape
torch.Size([3, 3])
forward(x)[source]

Computes the forward pass

Parameters

x (torch.Tensor) – a (batch, features) input tensor

Returns

output – the linear layer output

Return type

torch.Tensor

training: bool
class speechbrain.lobes.models.Tacotron2.ConvNorm(in_channels, out_channels, kernel_size=1, stride=1, padding=None, dilation=1, bias=True, w_init_gain='linear')[source]

Bases: Module

A 1D convolution layer with Xavier initialization

Parameters
  • in_channels (int) – the number of input channels

  • out_channels (int) – the number of output channels

  • kernel_size (int) – the kernel size

  • stride (int) – the convolutional stride

  • padding (int) – the amount of padding to include. If not provided, it will be calculated as dilation * (kernel_size - 1) / 2

  • dilation (int) – the dilation of the convolution

  • bias (bool) – whether or not to use a bias

  • w_init_gain (linear) – the weight initialization gain type (see torch.nn.init.calculate_gain)

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import ConvNorm
>>> layer = ConvNorm(in_channels=10, out_channels=5, kernel_size=3)
>>> x = torch.randn(3, 10, 5)
>>> y = layer(x)
>>> y.shape
torch.Size([3, 5, 5])
forward(signal)[source]

Computes the forward pass

Parameters

signal (torch.Tensor) – the input to the convolutional layer

Returns

output – the output

Return type

torch.Tensor

training: bool
class speechbrain.lobes.models.Tacotron2.LocationLayer(attention_n_filters=32, attention_kernel_size=31, attention_dim=128)[source]

Bases: Module

A location-based attention layer consisting of a Xavier-initialized convolutional layer followed by a dense layer

Parameters
  • attention_n_filters (int) – the number of filters used in attention

  • attention_kernel_size (int) – the kernel size of the attention layer

  • attention_dim (int) – the dimension of linear attention layers

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import LocationLayer
>>> layer = LocationLayer()
>>> attention_weights_cat = torch.randn(3, 2, 64)
>>> processed_attention = layer(attention_weights_cat)
>>> processed_attention.shape
torch.Size([3, 64, 128])
forward(attention_weights_cat)[source]

Performs the forward pass for the attention layer

Parameters
  • attention_weights_cat (torch.Tensor) – the concatenating attention weights

  • Results

  • -------

  • processed_attention (torch.Tensor) – the attention layer output

training: bool
class speechbrain.lobes.models.Tacotron2.Attention(attention_rnn_dim=1024, embedding_dim=512, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31)[source]

Bases: Module

The Tacotron attention layer. Location-based attention is used.

Parameters
  • attention_rnn_dim (int) – the dimension of the RNN to which the attention layer is applied

  • embedding_dim (int) – the embedding dimension

  • attention_dim (int) – the dimension of the memory cell

  • attenion_location_n_filters (int) – the number of location filters

  • attention_location_kernel_size (int) – the kernel size of the location layer

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import (
...     Attention, get_mask_from_lengths)
>>> layer = Attention()
>>> attention_hidden_state = torch.randn(2, 1024)
>>> memory = torch.randn(2, 173, 512)
>>> processed_memory = torch.randn(2, 173, 128)
>>> attention_weights_cat = torch.randn(2, 2, 173)
>>> memory_lengths = torch.tensor([173, 91])
>>> mask = get_mask_from_lengths(memory_lengths)
>>> attention_context, attention_weights = layer(
...    attention_hidden_state,
...    memory,
...    processed_memory,
...    attention_weights_cat,
...    mask
... )
>>> attention_context.shape, attention_weights.shape
(torch.Size([2, 512]), torch.Size([2, 173]))
get_alignment_energies(query, processed_memory, attention_weights_cat)[source]

Computes the alignment energies

Parameters
  • query (torch.Tensor) – decoder output (batch, n_mel_channels * n_frames_per_step)

  • processed_memory (torch.Tensor) – processed encoder outputs (B, T_in, attention_dim)

  • attention_weights_cat (torch.Tensor) – cumulative and prev. att weights (B, 2, max_time)

Returns

alignment – (batch, max_time)

Return type

torch.Tensor

forward(attention_hidden_state, memory, processed_memory, attention_weights_cat, mask)[source]

Computes the forward pass

Parameters
  • attention_hidden_state (torch.Tensor) – attention rnn last output

  • memory (torch.Tensor) – encoder outputs

  • processed_memory (torch.Tensor) – processed encoder outputs

  • attention_weights_cat (torch.Tensor) – previous and cummulative attention weights

  • mask (torch.Tensor) – binary mask for padded data

Returns

result – a (attention_context, attention_weights) tuple

Return type

tuple

training: bool
class speechbrain.lobes.models.Tacotron2.Prenet(in_dim=80, sizes=[256, 256], dropout=0.5)[source]

Bases: Module

The Tacotron pre-net module consisting of a specified number of normalized (Xavier-initialized) linear layers

Parameters
  • in_dim (int) – the input dimensions

  • sizes (int) – the dimension of the hidden layers/outout

  • dropout (float) – the dropout probability

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import Prenet
>>> layer = Prenet()
>>> x = torch.randn(862, 2, 80)
>>> output = layer(x)
>>> output.shape
torch.Size([862, 2, 256])
forward(x)[source]

Computes the forward pass for the prenet

Parameters

x (torch.Tensor) – the prenet inputs

Returns

output – the output

Return type

torch.Tensor

training: bool
class speechbrain.lobes.models.Tacotron2.Postnet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5)[source]

Bases: Module

The Tacotron postnet consists of a number of 1-d convolutional layers with Xavier initialization and a tanh activation, with batch normalization. Depending on configuration, the postnet may either refine the MEL spectrogram or upsample it to a linear spectrogram

Parameters
  • n_mel_channels (int) – the number of MEL spectrogram channels

  • postnet_embedding_dim (int) – the postnet embedding dimension

  • postnet_kernel_size (int) – the kernel size of the convolutions within the decoders

  • postnet_n_convolutions (int) – the number of convolutions in the postnet

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import Postnet
>>> layer = Postnet()
>>> x = torch.randn(2, 80, 861)
>>> output = layer(x)
>>> output.shape
torch.Size([2, 80, 861])
forward(x)[source]

Computes the forward pass of the postnet

Parameters

x (torch.Tensor) – the postnet input (usually a MEL spectrogram)

Returns

output – the postnet output (a refined MEL spectrogram or a linear spectrogram depending on how the model is configured)

Return type

torch.Tensor

training: bool
class speechbrain.lobes.models.Tacotron2.Encoder(encoder_n_convolutions=3, encoder_embedding_dim=512, encoder_kernel_size=5)[source]

Bases: Module

The Tacotron2 encoder module, consisting of a sequence of 1-d convolution banks (3 by default) and a bidirectional LSTM

Parameters
  • encoder_n_convolutions (int) – the number of encoder convolutions

  • encoder_embedding_dim (int) – the dimension of the encoder embedding

  • encoder_kernel_size (int) – the kernel size of the 1-D convolutional layers within the encoder

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import Encoder
>>> layer = Encoder()
>>> x = torch.randn(2, 512, 128)
>>> input_lengths = torch.tensor([128, 83])
>>> outputs = layer(x, input_lengths)
>>> outputs.shape
torch.Size([2, 128, 512])
forward(x, input_lengths)[source]

Computes the encoder forward pass

Parameters
  • x (torch.Tensor) – a batch of inputs (sequence embeddings)

  • input_lengths (torch.Tensor) – a tensor of input lengths

Returns

outputs – the encoder output

Return type

torch.Tensor

infer(x, input_lengths)[source]

Performs a forward stap in the inference context

Parameters
  • x (torch.Tensor) – a batch of inputs (sequence embeddings)

  • input_lengths (torch.Tensor) – a tensor of input lengths

Returns

outputs – the encoder output

Return type

torch.Tensor

training: bool
class speechbrain.lobes.models.Tacotron2.Decoder(n_mel_channels=80, n_frames_per_step=1, encoder_embedding_dim=512, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31, attention_rnn_dim=1024, decoder_rnn_dim=1024, prenet_dim=256, max_decoder_steps=1000, gate_threshold=0.5, p_attention_dropout=0.1, p_decoder_dropout=0.1, early_stopping=True)[source]

Bases: Module

The Tacotron decoder

Parameters
  • n_mel_channels (int) – the number of channels in the MEL sepctrogram

  • n_frames_per_step – the number of frames in the spectrogram for each time step of the decoder

  • encoder_embedding_dim (int) – the dimension of the encoder embedding

  • attention_location_n_filters (int) – the number of filters in location-based attention

  • attention_location_kernel_size (int) – the kernel size of location-based attention

  • attention_rnn_dim (int) – RNN dimension for the attention layer

  • decoder_rnn_dim (int) – the encoder RNN dimension

  • prenet_dim (int) – the dimension of the prenet (inner and output layers)

  • max_decoder_steps (int) – the maximum number of decoder steps for the longest utterance expected for the model

  • gate_threshold (float) – the fixed threshold to which the outputs of the decoders will be compared

  • p_attention_dropout (float) – dropout probability for attention layers

Example

>>> import torch
>>> from speechbrain.lobes.models.Tacotron2 import Decoder
>>> layer = Decoder()
>>> memory = torch.randn(2, 173, 512)
>>> decoder_inputs = torch.randn(2, 80, 173)
>>> memory_lengths = torch.tensor([173, 91])
>>> mel_outputs, gate_outputs, alignments = layer(
...     memory, decoder_inputs, memory_lengths)
>>> mel_outputs.shape, gate_outputs.shape, alignments.shape
(torch.Size([2, 80, 173]), torch.Size([2, 173]), torch.Size([2, 173, 173]))
get_go_frame(memory)[source]

Gets all zeros frames to use as first decoder input

Parameters

memory (torch.Tensor) – decoder outputs

Returns

decoder_input – all zeros frames

Return type

torch.Tensor

initialize_decoder_states(memory)[source]

Initializes attention rnn states, decoder rnn states, attention weights, attention cumulative weights, attention context, stores memory and stores processed memory

Parameters
  • memory (torch.Tensor) – Encoder outputs

  • mask (torch.Tensor) – Mask for padded data if training, expects None for inference

Returns

result – A tuple of tensors (

attention_hidden, attention_cell, decoder_hidden, decoder_cell, attention_weights, attention_weights_cum, attention_context, processed_memory,

)

Return type

tuple

parse_decoder_inputs(decoder_inputs)[source]

Prepares decoder inputs, i.e. mel outputs :param decoder_inputs: inputs used for teacher-forced training, i.e. mel-specs :type decoder_inputs: torch.Tensor

Returns

decoder_inputs – processed decoder inputs

Return type

torch.Tensor

parse_decoder_outputs(mel_outputs, gate_outputs, alignments)[source]

Prepares decoder outputs for output

Parameters
Returns

  • mel_outputs (torch.Tensor) – MEL-scale spectrogram outputs

  • gate_outputs (torch.Tensor) – gate output energies

  • alignments (torch.Tensor) – the alignment tensor

decode(decoder_input, attention_hidden, attention_cell, decoder_hidden, decoder_cell, attention_weights, attention_weights_cum, attention_context, memory, processed_memory, mask)[source]

Decoder step using stored states, attention and memory :param decoder_input: previous mel output :type decoder_input: torch.Tensor :param attention_hidden: the hidden state of the attention module :type attention_hidden: torch.Tensor :param attention_cell: the attention cell state :type attention_cell: torch.Tensor :param decoder_hidden: the decoder hidden state :type decoder_hidden: torch.Tensor :param decoder_cell: the decoder cell state :type decoder_cell: torch.Tensor :param attention_weights: the attention weights :type attention_weights: torch.Tensor :param attention_weights_cum: cumulative attention weights :type attention_weights_cum: torch.Tensor :param attention_context: the attention context tensor :type attention_context: torch.Tensor :param memory: the memory tensor :type memory: torch.Tensor :param processed_memory: the processed memory tensor :type processed_memory: torch.Tensor :param mask: :type mask: torch.Tensor

Returns

  • mel_output (torch.Tensor) – the MEL-scale outputs

  • gate_output (torch.Tensor) – gate output energies

  • attention_weights (torch.Tensor) – attention weights

forward(memory, decoder_inputs, memory_lengths)[source]

Decoder forward pass for training

Parameters
  • memory (torch.Tensor) – Encoder outputs

  • decoder_inputs (torch.Tensor) – Decoder inputs for teacher forcing. i.e. mel-specs

  • memory_lengths (torch.Tensor) – Encoder output lengths for attention masking.

Returns

  • mel_outputs (torch.Tensor) – mel outputs from the decoder

  • gate_outputs (torch.Tensor) – gate outputs from the decoder

  • alignments (torch.Tensor) – sequence of attention weights from the decoder

infer(memory, memory_lengths)[source]

Decoder inference

Parameters

memory (torch.Tensor) – Encoder outputs

Returns

  • mel_outputs (torch.Tensor) – mel outputs from the decoder

  • gate_outputs (torch.Tensor) – gate outputs from the decoder

  • alignments (torch.Tensor) – sequence of attention weights from the decoder

  • mel_lengths (torch.Tensor) – the length of MEL spectrograms

training: bool
class speechbrain.lobes.models.Tacotron2.Tacotron2(mask_padding=True, n_mel_channels=80, n_symbols=148, symbols_embedding_dim=512, encoder_kernel_size=5, encoder_n_convolutions=3, encoder_embedding_dim=512, attention_rnn_dim=1024, attention_dim=128, attention_location_n_filters=32, attention_location_kernel_size=31, n_frames_per_step=1, decoder_rnn_dim=1024, prenet_dim=256, max_decoder_steps=1000, gate_threshold=0.5, p_attention_dropout=0.1, p_decoder_dropout=0.1, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, decoder_no_early_stopping=False)[source]

Bases: Module

The Tactron2 text-to-speech model, based on the NVIDIA implementation.

This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers

Simplified STRUCTURE: input->word embedding ->encoder ->attention ->decoder(+prenet) -> postnet ->output

prenet(input is decoder previous time step) output is input to decoder concatenanted with the attention output

Parameters
  • mask_padding (bool) – whether or not to mask pad-outputs of tacotron

  • io (#mel generation parameter in data) –

  • n_mel_channels (int) – number of mel channels for constructing spectrogram

  • #symbols

  • n_symbols (int=128) – number of accepted char symbols defined in textToSequence

  • symbols_embedding_dim (int) – number of embeding dimension for symbols fed to nn.Embedding

  • parameters (# Decoder) –

  • encoder_kernel_size (int) – size of kernel processing the embeddings

  • encoder_n_convolutions (int) – number of convolution layers in encoder

  • encoder_embedding_dim (int) – number of kernels in encoder, this is also the dimension of the bidirectional LSTM in the encoder

  • parameters

  • attention_rnn_dim (int) – input dimension

  • attention_dim (int) – number of hidden represetation in attention

  • parameters

  • attention_location_n_filters (int) – number of 1-D convulation filters in attention

  • attention_location_kernel_size (int) – length of the 1-D convolution filters

  • parameters

  • n_frames_per_step (int=1) – only 1 generated mel-frame per step is supported for the decoder as of now.

  • decoder_rnn_dim (int) – number of 2 unidirectionnal stacked LSTM units

  • prenet_dim (int) – dimension of linear prenet layers

  • max_decoder_steps (int) – maximum number of steps/frames the decoder generates before stopping

  • p_attention_dropout (float) – attention drop out probability

  • p_decoder_dropout (float) – decoder drop out probability

  • gate_threshold (int) – cut off level any output probabilty above that is considered complete and stops genration so we have variable length outputs

  • decoder_no_early_stopping (bool) – determines early stopping of decoder along with gate_threshold . The logical inverse of this is fed to the decoder

#Mel-post processing network parameters postnet_embedding_dim: int

number os postnet dfilters

postnet_kernel_size: int

1d size of posnet kernel

postnet_n_convolutions: int

number of convolution layers in postnet

Example

>>> import torch
>>> _ = torch.manual_seed(213312)
>>> from speechbrain.lobes.models.Tacotron2 import Tacotron2
>>> model = Tacotron2(
...    mask_padding=True,
...    n_mel_channels=80,
...    n_symbols=148,
...    symbols_embedding_dim=512,
...    encoder_kernel_size=5,
...    encoder_n_convolutions=3,
...    encoder_embedding_dim=512,
...    attention_rnn_dim=1024,
...    attention_dim=128,
...    attention_location_n_filters=32,
...    attention_location_kernel_size=31,
...    n_frames_per_step=1,
...    decoder_rnn_dim=1024,
...    prenet_dim=256,
...    max_decoder_steps=32,
...    gate_threshold=0.5,
...    p_attention_dropout=0.1,
...    p_decoder_dropout=0.1,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    decoder_no_early_stopping=False
... )
>>> _ = model.eval()
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> input_lengths = torch.tensor([5, 4])
>>> outputs, output_lengths, alignments = model.infer(inputs, input_lengths)
>>> outputs.shape, output_lengths.shape, alignments.shape
(torch.Size([2, 80, 1]), torch.Size([2]), torch.Size([2, 1, 5]))
parse_output(outputs, output_lengths, alignments_dim=None)[source]

Masks the padded part of output

Parameters
  • outputs (list) – a list of tensors - raw outputs

  • outputs_lengths (torch.Tensor) – a tensor representing the lengths of all outputs

  • alignments_dim (int) – the desired dimension of the alignments along the last axis Optional but needed for data-parallel training

Returns

result – a (mel_outputs, mel_outputs_postnet, gate_outputs, alignments) tuple with the original outputs - with the mask applied

Return type

tuple

forward(inputs, alignments_dim=None)[source]

Decoder forward pass for training

Parameters
  • inputs (tuple) – batch object

  • alignments_dim (int) – the desired dimension of the alignments along the last axis Optional but needed for data-parallel training

Returns

  • mel_outputs (torch.Tensor) – mel outputs from the decoder

  • mel_outputs_postnet (torch.Tensor) – mel outputs from postnet

  • gate_outputs (torch.Tensor) – gate outputs from the decoder

  • alignments (torch.Tensor) – sequence of attention weights from the decoder

  • output_legnths (torch.Tensor) – length of the output without padding

infer(inputs, input_lengths)[source]

Produces outputs

Parameters
  • inputs (torch.tensor) – text or phonemes converted

  • input_lengths (torch.tensor) – the lengths of input parameters

Returns

  • mel_outputs_postnet (torch.Tensor) – final mel output of tacotron 2

  • mel_lengths (torch.Tensor) – length of mels

  • alignments (torch.Tensor) – sequence of attention weights

training: bool
speechbrain.lobes.models.Tacotron2.get_mask_from_lengths(lengths, max_len=None)[source]

Creates a mask from a tensor of lengths

Parameters

lengths (torch.Tensor) – a tensor of sequence lengths

Returns

  • mask (torch.Tensor) – the mask

  • max_len (int) – The maximum length, i.e. the last dimension of the mask tensor. If not provided, it will be calculated automatically

speechbrain.lobes.models.Tacotron2.infer(model, text_sequences, input_lengths)[source]

An inference hook for pretrained synthesizers

Parameters
Returns

result – (mel_outputs_postnet, mel_lengths, alignments) - the exact model output

Return type

tuple

speechbrain.lobes.models.Tacotron2.LossStats

alias of TacotronLoss

class speechbrain.lobes.models.Tacotron2.Loss(guided_attention_sigma=None, gate_loss_weight=1.0, guided_attention_weight=1.0, guided_attention_scheduler=None, guided_attention_hard_stop=None)[source]

Bases: Module

The Tacotron loss implementation

The loss consists of an MSE loss on the spectrogram, a BCE gate loss and a guided attention loss (if enabled) that attempts to make the attention matrix diagonal

The output of the moduel is a LossStats tuple, which includes both the total loss

Parameters
  • guided_attention_sigma (float) – The guided attention sigma factor, controling the “width” of the mask

  • gate_loss_weight (float) – The constant by which the hate loss will be multiplied

  • guided_attention_weight (float) – The weight for the guided attention

  • guided_attention_scheduler (callable) – The scheduler class for the guided attention loss

  • guided_attention_hard_stop (int) – The number of epochs after which guided attention will be compeltely turned off

  • Example

  • torch (>>> import) –

  • torch.manual_seed(42) (>>> _ =) –

  • Loss (>>> from speechbrain.lobes.models.Tacotron2 import) –

  • Loss(guided_attention_sigma=0.2) (>>> loss =) –

  • torch.randn(2 (>>> alignments =) –

  • 80

  • 861)

  • torch.randn(1722 (>>> gate_target =) –

  • 1)

  • torch.randn(2

  • 80

  • 861)

  • torch.randn(2

  • 80

  • 861)

  • torch.randn(2

  • 861)

  • torch.randn(2

  • 861

  • 173)

  • mel_target (>>> targets =) –

  • gate_target

  • mel_out (>>> model_outputs =) –

  • mel_out_postnet

  • gate_out

  • alignments

  • torch.tensor([173 (>>> input_lengths =) –

  • 91])

  • torch.tensor([861 (>>> target_lengths =) –

  • 438])

  • loss(model_outputs (>>>) –

  • targets

  • input_lengths

  • target_lengths

  • 1)

  • TacotronLoss(loss=tensor(4.8566)

  • mel_loss=tensor(4.0097)

  • gate_loss=tensor(0.8460)

  • attn_loss=tensor(0.0010)

  • attn_weight=tensor(1.))

forward(model_output, targets, input_lengths, target_lengths, epoch)[source]

Computes the loss

Parameters
  • model_output (tuple) – the output of the model’s forward(): (mel_outputs, mel_outputs_postnet, gate_outputs, alignments)

  • targets (tuple) – the targets

  • input_lengths (torch.Tensor) – a (batch, length) tensor of input lengths

  • target_lengths (torch.Tensor) – a (batch, length) tensor of target (spectrogram) lengths

  • epoch (int) – the current epoch number (used for the scheduling of the guided attention loss) A StepScheduler is typically used

Returns

result – the total loss - and individual losses (mel and gate)

Return type

LossStats

get_attention_loss(alignments, input_lengths, target_lengths, epoch)[source]

Computes the attention loss

Parameters
  • alignments (torch.Tensor) – the aligment matrix from the model

  • input_lengths (torch.Tensor) – a (batch, length) tensor of input lengths

  • target_lengths (torch.Tensor) – a (batch, length) tensor of target (spectrogram) lengths

  • epoch (int) – the current epoch number (used for the scheduling of the guided attention loss) A StepScheduler is typically used

Returns

attn_loss – the attention loss value

Return type

torch.Tensor

training: bool
class speechbrain.lobes.models.Tacotron2.TextMelCollate(n_frames_per_step=1)[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step

Parameters

n_frames_per_step (int) – the number of output frames per step

Returns

result – a tuple of tensors to be used as inputs/targets (

text_padded, input_lengths, mel_padded, gate_padded, output_lengths, len_x

)

Return type

tuple

__call__(batch)[source]

Collate’s training batch from normalized text and mel-spectrogram :param batch: [text_normalized, mel_normalized] :type batch: list

speechbrain.lobes.models.Tacotron2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]

Dynamic range compression for audio signals

speechbrain.lobes.models.Tacotron2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • compression (bool) – whether to do dynamic range compression

  • audio (torch.tensor) – input audio signal