speechbrain.lobes.models.FastSpeech2 module

Neural network modules for the FastSpeech 2: Fast and High-Quality End-to-End Text to Speech synthesis model Authors * Sathvik Udupa 2022 * Pradnya Kandarkar 2023 * Yingzhi Wang 2023

Summary

Classes:

AlignmentNetwork

Learns the alignment between the input text and the spectrogram with Gaussian Attention.

BinaryAlignmentLoss

Binary loss that forces soft alignments to match the hard alignments as explained in https://arxiv.org/pdf/2108.10447.pdf.

DurationPredictor

Duration predictor layer :param in_channels: input feature dimension for convolution layers :type in_channels: int :param out_channels: output feature dimension for convolution layers :type out_channels: int :param kernel_size: duration predictor convolution kernal size :type kernel_size: int :param dropout: dropout probability, 0 by default :type dropout: float

EncoderPreNet

Embedding layer for tokens :param n_vocab: size of the dictionary of embeddings :type n_vocab: int :param blank_id: padding index :type blank_id: int :param out_channels: the size of each embedding vector :type out_channels: int

FastSpeech2

The FastSpeech2 text-to-speech model.

FastSpeech2WithAlignment

The FastSpeech2 text-to-speech model with internal alignment.

ForwardSumLoss

CTC alignment loss :param blank_logprob: :type blank_logprob: pad value

Loss

Loss Computation :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: int :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: int :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: int :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: int :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: int

LossWithAlignment

Loss computation including internal aligner :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param ssim_loss_weight: weight for the ssim loss :type ssim_loss_weight: float :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: float :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: float :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: float :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: float :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: float :param aligner_loss_weight: weight for the alignment loss :type aligner_loss_weight: float :param binary_alignment_loss_weight: weight for the postnet mel loss :type binary_alignment_loss_weight: float :param binary_alignment_loss_warmup_epochs: Number of epochs to gradually increase the impact of binary loss.

PostNet

FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float

SPNPredictor

This module for the silent phoneme predictor.

SSIMLoss

SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity

TextMelCollate

Zero-pads model inputs and targets based on number of frames per step result: tuple a tuple of tensors to be used as inputs/targets ( text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs )

TextMelCollateWithAlignment

Zero-pads model inputs and targets based on number of frames per step result: tuple a tuple of tensors to be used as inputs/targets ( text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs )

Functions:

average_over_durations

Average values over durations.

dynamic_range_compression

Dynamic range compression for audio signals

maximum_path_numpy

Monotonic alignment search algorithm, numpy works faster than the torch implementation.

mel_spectogram

calculates MelSpectrogram for a raw audio signal :param sample_rate: Sample rate of audio signal.

upsample

upsample encoder ouput according to durations :param feats: batch of input tokens :type feats: torch.tensor :param durs: durations to be used to upsample :type durs: torch.tensor :param pace: scaling factor for durations :type pace: float :param padding_value: padding index :type padding_value: int

Reference

class speechbrain.lobes.models.FastSpeech2.EncoderPreNet(n_vocab, blank_id, out_channels=512)[source]

Bases: Module

Embedding layer for tokens :param n_vocab: size of the dictionary of embeddings :type n_vocab: int :param blank_id: padding index :type blank_id: int :param out_channels: the size of each embedding vector :type out_channels: int

Example

>>> from speechbrain.nnet.embedding import Embedding
>>> from speechbrain.lobes.models.FastSpeech2 import EncoderPreNet
>>> encoder_prenet_layer = EncoderPreNet(n_vocab=40, blank_id=0, out_channels=384)
>>> x = torch.rand(3, 5)
>>> y = encoder_prenet_layer(x)
>>> y.shape
torch.Size([3, 5, 384])
forward(x)[source]

Computes the forward pass :param x: a (batch, tokens) input tensor :type x: torch.Tensor

Returns:

output – the embedding layer output

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.PostNet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, postnet_dropout=0.5)[source]

Bases: Module

FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float

forward(x)[source]

Computes the forward pass :param x: a (batch, time_steps, features) input tensor :type x: torch.Tensor

Returns:

output – the spectrogram predicted

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.DurationPredictor(in_channels, out_channels, kernel_size, dropout=0.0, n_units=1)[source]

Bases: Module

Duration predictor layer :param in_channels: input feature dimension for convolution layers :type in_channels: int :param out_channels: output feature dimension for convolution layers :type out_channels: int :param kernel_size: duration predictor convolution kernal size :type kernel_size: int :param dropout: dropout probability, 0 by default :type dropout: float

Example

>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> duration_predictor_layer = DurationPredictor(in_channels=384, out_channels=384, kernel_size=3)
>>> x = torch.randn(3, 400, 384)
>>> mask = torch.ones(3, 400, 384)
>>> y = duration_predictor_layer(x, mask)
>>> y.shape
torch.Size([3, 400, 1])
forward(x, x_mask)[source]

Computes the forward pass :param x: a (batch, time_steps, features) input tensor :type x: torch.Tensor :param x_mask: mask of input tensor :type x_mask: torch.Tensor

Returns:

output – the duration predictor outputs

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.SPNPredictor(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, padding_idx)[source]

Bases: Module

This module for the silent phoneme predictor. It receives phoneme sequences without any silent phoneme token as input and predicts whether a silent phoneme should be inserted after a position. This is to avoid the issue of fast pace at inference time due to having no silent phoneme tokens in the input sequence.

Parameters:
  • enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder

  • enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers

  • enc_d_model (int) – the number of expected features in the encoder

  • enc_ffn_dim (int) – the dimension of the feedforward network model

  • enc_k_dim (int) – the dimension of the key

  • enc_v_dim (int) – the dimension of the value

  • enc_dropout (float) – Dropout for the encoder

  • normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.

  • ffn_type (str) – whether to use convolutional layers instead of feed forward network inside tranformer layer

  • ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn

  • n_char (int) – the number of symbols for the token embedding

  • padding_idx (int) – the index for padding

forward(tokens, last_phonemes)[source]

forward pass for the module :param tokens: input tokens without silent phonemes :type tokens: torch.Tensor :param last_phonemes: indicates if a phoneme at an index is the last phoneme of a word or not :type last_phonemes: torch.Tensor

Returns:

spn_decision – indicates if a silent phoneme should be inserted after a phoneme

Return type:

torch.Tensor

infer(tokens, last_phonemes)[source]

inference function :param tokens: input tokens without silent phonemes :type tokens: torch.Tensor :param last_phonemes: indicates if a phoneme at an index is the last phoneme of a word or not :type last_phonemes: torch.Tensor

Returns:

spn_decision – indicates if a silent phoneme should be inserted after a phoneme

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.FastSpeech2(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]

Bases: Module

The FastSpeech2 text-to-speech model. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers Simplified STRUCTURE: input->token embedding ->encoder ->duration/pitch/energy predictor ->duration upsampler -> decoder -> output During training, teacher forcing is used (ground truth durations are used for upsampling) :param #encoder parameters: :param enc_num_layers: number of transformer layers (TransformerEncoderLayer) in encoder :type enc_num_layers: int :param enc_num_head: number of multi-head-attention (MHA) heads in encoder transformer layers :type enc_num_head: int :param enc_d_model: the number of expected features in the encoder :type enc_d_model: int :param enc_ffn_dim: the dimension of the feedforward network model :type enc_ffn_dim: int :param enc_k_dim: the dimension of the key :type enc_k_dim: int :param enc_v_dim: the dimension of the value :type enc_v_dim: int :param enc_dropout: Dropout for the encoder :type enc_dropout: float :param normalize_before: whether normalization should be applied before or after MHA or FFN in Transformer layers. :type normalize_before: bool :param ffn_type: whether to use convolutional layers instead of feed forward network inside tranformer layer :type ffn_type: str :param ffn_cnn_kernel_size_list: conv kernel size of 2 1d-convs if ffn_type is 1dcnn :type ffn_cnn_kernel_size_list: list of int :param #decoder parameters: :param dec_num_layers: number of transformer layers (TransformerEncoderLayer) in decoder :type dec_num_layers: int :param dec_num_head: number of multi-head-attention (MHA) heads in decoder transformer layers :type dec_num_head: int :param dec_d_model: the number of expected features in the decoder :type dec_d_model: int :param dec_ffn_dim: the dimension of the feedforward network model :type dec_ffn_dim: int :param dec_k_dim: the dimension of the key :type dec_k_dim: int :param dec_v_dim: the dimension of the value :type dec_v_dim: int :param dec_dropout: dropout for the decoder :type dec_dropout: float :param normalize_before: whether normalization should be applied before or after MHA or FFN in Transformer layers. :type normalize_before: bool :param ffn_type: whether to use convolutional layers instead of feed forward network inside tranformer layer. :type ffn_type: str :param ffn_cnn_kernel_size_list: conv kernel size of 2 1d-convs if ffn_type is 1dcnn :type ffn_cnn_kernel_size_list: list of int :param n_char: the number of symbols for the token embedding :type n_char: int :param n_mels: number of bins in mel spectrogram :type n_mels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float :param padding_idx: the index for padding :type padding_idx: int :param dur_pred_kernel_size: the convolution kernel size in duration predictor :type dur_pred_kernel_size: int :param pitch_pred_kernel_size: kernel size for pitch prediction. :type pitch_pred_kernel_size: int :param energy_pred_kernel_size: kernel size for energy prediction. :type energy_pred_kernel_size: int :param variance_predictor_dropout: dropout probability for variance predictor (duration/pitch/energy) :type variance_predictor_dropout: float

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> model = FastSpeech2(
...    enc_num_layers=6,
...    enc_num_head=2,
...    enc_d_model=384,
...    enc_ffn_dim=1536,
...    enc_k_dim=384,
...    enc_v_dim=384,
...    enc_dropout=0.1,
...    dec_num_layers=6,
...    dec_num_head=2,
...    dec_d_model=384,
...    dec_ffn_dim=1536,
...    dec_k_dim=384,
...    dec_v_dim=384,
...    dec_dropout=0.1,
...    normalize_before=False,
...    ffn_type='1dcnn',
...    ffn_cnn_kernel_size_list=[9, 1],
...    n_char=40,
...    n_mels=80,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    postnet_dropout=0.5,
...    padding_idx=0,
...    dur_pred_kernel_size=3,
...    pitch_pred_kernel_size=3,
...    energy_pred_kernel_size=3,
...    variance_predictor_dropout=0.5)
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> input_lengths = torch.tensor([5, 4])
>>> durations = torch.tensor([
...     [2, 4, 1, 5, 3],
...     [1, 2, 4, 3, 0],
... ])
>>> mel_post, postnet_output, predict_durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens = model(inputs, durations=durations)
>>> mel_post.shape, predict_durations.shape
(torch.Size([2, 15, 80]), torch.Size([2, 5]))
>>> predict_pitch.shape, predict_energy.shape
(torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))
forward(tokens, durations=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

forward pass for training and inference :param tokens: batch of input tokens :type tokens: torch.Tensor :param durations: batch of durations for each token. If it is None, the model will infer on predicted durations :type durations: torch.Tensor :param pitch: batch of pitch for each frame. If it is None, the model will infer on predicted pitches :type pitch: torch.Tensor :param energy: batch of energy for each frame. If it is None, the model will infer on predicted energies :type energy: torch.Tensor :param pace: scaling factor for durations :type pace: float :param pitch_rate: scaling factor for pitches :type pitch_rate: float :param energy_rate: scaling factor for energies :type energy_rate: float

Returns:

  • mel_post (torch.Tensor) – mel outputs from the decoder

  • postnet_output (torch.Tensor) – mel outputs from the postnet

  • predict_durations (torch.Tensor) – predicted durations of each token

  • predict_pitch (torch.Tensor) – predicted pitches of each token

  • avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None

  • predict_energy (torch.Tensor) – predicted energies of each token

  • avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None

  • mel_length – predicted lengths of mel spectrograms

training: bool
speechbrain.lobes.models.FastSpeech2.average_over_durations(values, durs)[source]

Average values over durations. :param values: shape: [B, 1, T_de] :type values: torch.Tensor :param durs: shape: [B, T_en] :type durs: torch.Tensor

Returns:

avg – shape: [B, 1, T_en]

Return type:

torch.Tensor

speechbrain.lobes.models.FastSpeech2.upsample(feats, durs, pace=1.0, padding_value=0.0)[source]

upsample encoder ouput according to durations :param feats: batch of input tokens :type feats: torch.tensor :param durs: durations to be used to upsample :type durs: torch.tensor :param pace: scaling factor for durations :type pace: float :param padding_value: padding index :type padding_value: int

Returns:

  • mel_post (torch.Tensor) – mel outputs from the decoder

  • predict_durations (torch.Tensor) – predicted durations for each token

class speechbrain.lobes.models.FastSpeech2.TextMelCollate[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step result: tuple

a tuple of tensors to be used as inputs/targets (

text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs

)

__call__(batch)[source]

Collate’s training batch from normalized text and mel-spectrogram :param batch: [text_normalized, mel_normalized] :type batch: list

class speechbrain.lobes.models.FastSpeech2.Loss(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, spn_loss_weight=1.0, spn_loss_max_epochs=8)[source]

Bases: Module

Loss Computation :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: int :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: int :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: int :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: int :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: int

forward(predictions, targets, current_epoch)[source]

Computes the value of the loss function and updates stats :param predictions: model predictions :type predictions: tuple :param targets: ground truth data :type targets: tuple

Returns:

loss – the loss value

Return type:

torch.Tensor

training: bool
speechbrain.lobes.models.FastSpeech2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, min_max_energy_norm, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal :param sample_rate: Sample rate of audio signal. :type sample_rate: int :param hop_length: Length of hop between STFT windows. :type hop_length: int :param win_length: Window size. :type win_length: int :param n_fft: Size of FFT. :type n_fft: int :param n_mels: Number of mel filterbanks. :type n_mels: int :param f_min: Minimum frequency. :type f_min: float :param f_max: Maximum frequency. :type f_max: float :param power: Exponent for the magnitude spectrogram. :type power: float :param normalized: Whether to normalize by magnitude after stft. :type normalized: bool :param norm: If “slaney”, divide the triangular mel weights by the width of the mel band :type norm: str or None :param mel_scale: Scale to use: “htk” or “slaney”. :type mel_scale: str :param compression: whether to do dynamic range compression :type compression: bool :param audio: input audio signal :type audio: torch.tensor

speechbrain.lobes.models.FastSpeech2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]

Dynamic range compression for audio signals

class speechbrain.lobes.models.FastSpeech2.SSIMLoss[source]

Bases: Module

SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity

sequence_mask(sequence_length, max_len=None)[source]

Create a sequence mask for filtering padding in a sequence tensor. :param sequence_length: Sequence lengths. :type sequence_length: torch.Tensor :param max_len: Maximum sequence length. Defaults to None. :type max_len: int

Returns:

mask

Return type:

[B, T_max]

sample_wise_min_max(x: Tensor, mask: Tensor)[source]

Min-Max normalize tensor through first dimension :param x: input tensor [B, D1, D2] :type x: torch.Tensor :param m: input mask [B, D1, 1] :type m: torch.Tensor

forward(y_hat, y, length)[source]
Parameters:
Returns:

loss

Return type:

Average loss value in range [0, 1] masked by the length.

training: bool
class speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step result: tuple

a tuple of tensors to be used as inputs/targets (

text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs

)

__call__(batch)[source]

Collate’s training batch from normalized text and mel-spectrogram :param batch: [text_normalized, mel_normalized] :type batch: list

speechbrain.lobes.models.FastSpeech2.maximum_path_numpy(value, mask)[source]

Monotonic alignment search algorithm, numpy works faster than the torch implementation. :param value: input alignment values [b, t_x, t_y] :type value: torch.Tensor :param mask: input alignment mask [b, t_x, t_y] :type mask: torch.Tensor

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import maximum_path_numpy
>>> alignment = torch.rand(2, 5, 100)
>>> mask = torch.ones(2, 5, 100)
>>> hard_alignments = maximum_path_numpy(alignment, mask)
class speechbrain.lobes.models.FastSpeech2.AlignmentNetwork(in_query_channels=80, in_key_channels=512, attn_channels=80, temperature=0.0005)[source]

Bases: Module

Learns the alignment between the input text and the spectrogram with Gaussian Attention.

query -> conv1d -> relu -> conv1d -> relu -> conv1d -> L2_dist -> softmax -> alignment key -> conv1d -> relu -> conv1d - - - - - - - - - - - -^

Parameters:
  • in_query_channels (int) – Number of channels in the query network. Defaults to 80.

  • in_key_channels (int) – Number of channels in the key network. Defaults to 512.

  • attn_channels (int) – Number of inner channels in the attention layers. Defaults to 80.

  • temperature (float) – Temperature for the softmax. Defaults to 0.0005.

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import AlignmentNetwork
>>> aligner = AlignmentNetwork(
...     in_query_channels=80,
...     in_key_channels=512,
...     attn_channels=80,
...     temperature=0.0005,
... )
>>> phoneme_feats = torch.rand(2, 512, 20)
>>> mels = torch.rand(2, 80, 100)
>>> alignment_soft, alignment_logprob = aligner(mels, phoneme_feats, None, None)
>>> alignment_soft.shape, alignment_logprob.shape
(torch.Size([2, 1, 100, 20]), torch.Size([2, 1, 100, 20]))
forward(queries, keys, mask, attn_prior)[source]

Forward pass of the aligner encoder. :param queries: the query tensor [B, C, T_de] :type queries: torch.Tensor :param keys: the query tensor [B, C_emb, T_en] :type keys: torch.Tensor :param mask: the query mask[B, T_de] :type mask: torch.Tensor :param attn_prior: the prior attention tensor [B, 1, T_en, T_de] :type attn_prior: torch.Tensor

Returns:

  • attn (torch.tensor) – soft attention [B, 1, T_en, T_de]

  • attn_logp (torch.tensor) – log probabilities [B, 1, T_en , T_de]

training: bool
class speechbrain.lobes.models.FastSpeech2.FastSpeech2WithAlignment(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, in_query_channels, in_key_channels, attn_channels, temperature, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]

Bases: Module

The FastSpeech2 text-to-speech model with internal alignment. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers. Certain parts are adopted from the following implementation: https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/models/forward_tts.py

Simplified STRUCTURE: input -> token embedding -> encoder -> aligner -> duration/pitch/energy -> upsampler -> decoder -> output

Parameters:
  • parameters (#decoder) –

  • enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder

  • enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers

  • enc_d_model (int) – the number of expected features in the encoder

  • enc_ffn_dim (int) – the dimension of the feedforward network model

  • enc_k_dim (int) – the dimension of the key

  • enc_v_dim (int) – the dimension of the value

  • enc_dropout (float) – Dropout for the encoder

  • normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.

  • ffn_type (str) – whether to use convolutional layers instead of feed forward network inside tranformer layer

  • ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn

  • parameters

  • in_query_channels (int) – Number of channels in the query network.

  • in_key_channels (int) – Number of channels in the key network.

  • attn_channels (int) – Number of inner channels in the attention layers.

  • temperature (float) – Temperature for the softmax.

  • parameters

  • dec_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in decoder

  • dec_num_head (int) – number of multi-head-attention (MHA) heads in decoder transformer layers

  • dec_d_model (int) – the number of expected features in the decoder

  • dec_ffn_dim (int) – the dimension of the feedforward network model

  • dec_k_dim (int) – the dimension of the key

  • dec_v_dim (int) – the dimension of the value

  • dec_dropout (float) – dropout for the decoder

  • normalize_before – whether normalization should be applied before or after MHA or FFN in Transformer layers.

  • ffn_type – whether to use convolutional layers instead of feed forward network inside tranformer layer.

  • ffn_cnn_kernel_size_list – conv kernel size of 2 1d-convs if ffn_type is 1dcnn

  • n_char (int) – the number of symbols for the token embedding

  • n_mels (int) – number of bins in mel spectrogram

  • postnet_embedding_dim (int) – output feature dimension for convolution layers

  • postnet_kernel_size (int) – postnet convolution kernal size

  • postnet_n_convolutions (int) – number of convolution layers

  • postnet_dropout (float) – dropout probability fot postnet

  • padding_idx (int) – the index for padding

  • dur_pred_kernel_size (int) – the convolution kernel size in duration predictor

  • pitch_pred_kernel_size (int) – kernel size for pitch prediction.

  • energy_pred_kernel_size (int) – kernel size for energy prediction.

  • variance_predictor_dropout (float) – dropout probability for variance predictor (duration/pitch/energy)

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2WithAlignment
>>> model = FastSpeech2WithAlignment(
...    enc_num_layers=6,
...    enc_num_head=2,
...    enc_d_model=384,
...    enc_ffn_dim=1536,
...    enc_k_dim=384,
...    enc_v_dim=384,
...    enc_dropout=0.1,
...    in_query_channels=80,
...    in_key_channels=384,
...    attn_channels=80,
...    temperature=0.0005,
...    dec_num_layers=6,
...    dec_num_head=2,
...    dec_d_model=384,
...    dec_ffn_dim=1536,
...    dec_k_dim=384,
...    dec_v_dim=384,
...    dec_dropout=0.1,
...    normalize_before=False,
...    ffn_type='1dcnn',
...    ffn_cnn_kernel_size_list=[9, 1],
...    n_char=40,
...    n_mels=80,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    postnet_dropout=0.5,
...    padding_idx=0,
...    dur_pred_kernel_size=3,
...    pitch_pred_kernel_size=3,
...    energy_pred_kernel_size=3,
...    variance_predictor_dropout=0.5)
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> mels = torch.rand(2, 100, 80)
>>> mel_post, postnet_output, durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens, alignment_durations, alignment_soft, alignment_logprob, alignment_mas = model(inputs, mels)
>>> mel_post.shape, durations.shape
(torch.Size([2, 100, 80]), torch.Size([2, 5]))
>>> predict_pitch.shape, predict_energy.shape
(torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))
>>> alignment_soft.shape, alignment_mas.shape
(torch.Size([2, 100, 5]), torch.Size([2, 100, 5]))
forward(tokens, mel_spectograms=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

forward pass for training and inference :param tokens: batch of input tokens :type tokens: torch.Tensor :param mel_spectograms: batch of mel_spectograms (used only for training) :type mel_spectograms: torch.Tensor :param pitch: batch of pitch for each frame. If it is None, the model will infer on predicted pitches :type pitch: torch.Tensor :param energy: batch of energy for each frame. If it is None, the model will infer on predicted energies :type energy: torch.Tensor :param pace: scaling factor for durations :type pace: float :param pitch_rate: scaling factor for pitches :type pitch_rate: float :param energy_rate: scaling factor for energies :type energy_rate: float

Returns:

  • mel_post (torch.Tensor) – mel outputs from the decoder

  • postnet_output (torch.Tensor) – mel outputs from the postnet

  • predict_durations (torch.Tensor) – predicted durations of each token

  • predict_pitch (torch.Tensor) – predicted pitches of each token

  • avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None

  • predict_energy (torch.Tensor) – predicted energies of each token

  • avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None

  • mel_length – predicted lengths of mel spectrograms

  • alignment_durations – durations from the hard alignment map

  • alignment_soft (torch.Tensor) – soft alignment potentials

  • alignment_logprob (torch.Tensor) – log scale alignment potentials

  • alignment_mas (torch.Tensor) – hard alignment map

training: bool
class speechbrain.lobes.models.FastSpeech2.LossWithAlignment(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, aligner_loss_weight, binary_alignment_loss_weight, binary_alignment_loss_warmup_epochs, binary_alignment_loss_max_epochs)[source]

Bases: Module

Loss computation including internal aligner :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param ssim_loss_weight: weight for the ssim loss :type ssim_loss_weight: float :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: float :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: float :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: float :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: float :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: float :param aligner_loss_weight: weight for the alignment loss :type aligner_loss_weight: float :param binary_alignment_loss_weight: weight for the postnet mel loss :type binary_alignment_loss_weight: float :param binary_alignment_loss_warmup_epochs: Number of epochs to gradually increase the impact of binary loss. :type binary_alignment_loss_warmup_epochs: int :param binary_alignment_loss_max_epochs: From this epoch on the impact of binary loss is ignored. :type binary_alignment_loss_max_epochs: int

forward(predictions, targets, current_epoch)[source]

Computes the value of the loss function and updates stats :param predictions: model predictions :type predictions: tuple :param targets: ground truth data :type targets: tuple :param current_epoch: used to determinate the start/end of the binary alignment loss :type current_epoch: int

Returns:

loss – the loss value

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.ForwardSumLoss(blank_logprob=-1)[source]

Bases: Module

CTC alignment loss :param blank_logprob: :type blank_logprob: pad value

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import ForwardSumLoss
>>> loss_func = ForwardSumLoss()
>>> attn_logprob = torch.rand(2, 1, 100, 5)
>>> key_lens = torch.tensor([5, 5])
>>> query_lens = torch.tensor([100, 100])
>>> loss = loss_func(attn_logprob, key_lens, query_lens)
forward(attn_logprob, key_lens, query_lens)[source]
Parameters:
training: bool
class speechbrain.lobes.models.FastSpeech2.BinaryAlignmentLoss[source]

Bases: Module

Binary loss that forces soft alignments to match the hard alignments as explained in https://arxiv.org/pdf/2108.10447.pdf. .. rubric:: Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import BinaryAlignmentLoss
>>> loss_func = BinaryAlignmentLoss()
>>> alignment_hard = torch.randint(0, 2, (2, 100, 5))
>>> alignment_soft = torch.rand(2, 100, 5)
>>> loss = loss_func(alignment_hard, alignment_soft)
forward(alignment_hard, alignment_soft)[source]
alignment_hard: torch.Tensor

hard alignment map [B, mel_lens, phoneme_lens]

alignment_soft: torch.Tensor

soft alignment potentials [B, mel_lens, phoneme_lens]

training: bool