speechbrain.lobes.models.FastSpeech2 module

Neural network modules for the FastSpeech 2: Fast and High-Quality End-to-End Text to Speech synthesis model Authors * Sathvik Udupa 2022 * Pradnya Kandarkar 2023 * Yingzhi Wang 2023

Summary

Classes:

DurationPredictor

Duration predictor layer :param in_channels: input feature dimension for convolution layers :type in_channels: int :param out_channels: output feature dimension for convolution layers :type out_channels: int :param kernel_size: duration predictor convolution kernal size :type kernel_size: int :param dropout: dropout probability, 0 by default :type dropout: float

EncoderPreNet

Embedding layer for tokens :param n_vocab: size of the dictionary of embeddings :type n_vocab: int :param blank_id: padding index :type blank_id: int :param out_channels: the size of each embedding vector :type out_channels: int

FastSpeech2

The FastSpeech2 text-to-speech model.

Loss

Loss Computation :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: int :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: int :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: int :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: int :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: int

PositionalEmbedding

Computation of the positional embeddings.

PostNet

FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float

SPNPredictor

This module for the silent phoneme predictor.

SSIMLoss

SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity

TextMelCollate

Zero-pads model inputs and targets based on number of frames per step result: tuple a tuple of tensors to be used as inputs/targets ( text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs )

Functions:

average_over_durations

Average values over durations.

dynamic_range_compression

Dynamic range compression for audio signals

mel_spectogram

calculates MelSpectrogram for a raw audio signal :param sample_rate: Sample rate of audio signal.

upsample

upsample encoder ouput according to durations :param feats: batch of input tokens :type feats: torch.tensor :param durs: durations to be used to upsample :type durs: torch.tensor :param pace: scaling factor for durations :type pace: float :param padding_value: padding index :type padding_value: int

Reference

class speechbrain.lobes.models.FastSpeech2.PositionalEmbedding(embed_dim)[source]

Bases: Module

Computation of the positional embeddings. :param embed_dim: dimensionality of the embeddings. :type embed_dim: int

forward(seq_len, mask, dtype)[source]

Computes the forward pass :param seq_len: length of the sequence :type seq_len: int :param mask: mask applied to the positional embeddings :type mask: torch.tensor :param dtype: dtype of the embeddings :type dtype: str

Returns:

pos_emb – the tensor with positional embeddings

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.EncoderPreNet(n_vocab, blank_id, out_channels=512)[source]

Bases: Module

Embedding layer for tokens :param n_vocab: size of the dictionary of embeddings :type n_vocab: int :param blank_id: padding index :type blank_id: int :param out_channels: the size of each embedding vector :type out_channels: int

Example

>>> from speechbrain.nnet.embedding import Embedding
>>> from speechbrain.lobes.models.FastSpeech2 import EncoderPreNet
>>> encoder_prenet_layer = EncoderPreNet(n_vocab=40, blank_id=0, out_channels=384)
>>> x = torch.rand(3, 5)
>>> y = encoder_prenet_layer(x)
>>> y.shape
torch.Size([3, 5, 384])
forward(x)[source]

Computes the forward pass :param x: a (batch, tokens) input tensor :type x: torch.Tensor

Returns:

output – the embedding layer output

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.PostNet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, postnet_dropout=0.5)[source]

Bases: Module

FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float

forward(x)[source]

Computes the forward pass :param x: a (batch, time_steps, features) input tensor :type x: torch.Tensor

Returns:

output – the spectrogram predicted

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.DurationPredictor(in_channels, out_channels, kernel_size, dropout=0.0, n_units=1)[source]

Bases: Module

Duration predictor layer :param in_channels: input feature dimension for convolution layers :type in_channels: int :param out_channels: output feature dimension for convolution layers :type out_channels: int :param kernel_size: duration predictor convolution kernal size :type kernel_size: int :param dropout: dropout probability, 0 by default :type dropout: float

Example

>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> duration_predictor_layer = DurationPredictor(in_channels=384, out_channels=384, kernel_size=3)
>>> x = torch.randn(3, 400, 384)
>>> mask = torch.ones(3, 400, 384)
>>> y = duration_predictor_layer(x, mask)
>>> y.shape
torch.Size([3, 400, 1])
forward(x, x_mask)[source]

Computes the forward pass :param x: a (batch, time_steps, features) input tensor :type x: torch.Tensor :param x_mask: mask of input tensor :type x_mask: torch.Tensor

Returns:

output – the duration predictor outputs

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.SPNPredictor(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, padding_idx)[source]

Bases: Module

This module for the silent phoneme predictor. It receives phoneme sequences without any silent phoneme token as input and predicts whether a silent phoneme should be inserted after a position. This is to avoid the issue of fast pace at inference time due to having no silent phoneme tokens in the input sequence.

Parameters:
  • enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder

  • enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers

  • enc_d_model (int) – the number of expected features in the encoder

  • enc_ffn_dim (int) – the dimension of the feedforward network model

  • enc_k_dim (int) – the dimension of the key

  • enc_v_dim (int) – the dimension of the value

  • enc_dropout (float) – Dropout for the encoder

  • normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.

  • ffn_type (str) – whether to use convolutional layers instead of feed forward network inside tranformer layer

  • ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn

  • n_char (int) – the number of symbols for the token embedding

  • padding_idx (int) – the index for padding

forward(tokens, last_phonemes)[source]

forward pass for the module :param tokens: input tokens without silent phonemes :type tokens: torch.Tensor :param last_phonemes: indicates if a phoneme at an index is the last phoneme of a word or not :type last_phonemes: torch.Tensor

Returns:

spn_decision – indicates if a silent phoneme should be inserted after a phoneme

Return type:

torch.Tensor

infer(tokens, last_phonemes)[source]

inference function :param tokens: input tokens without silent phonemes :type tokens: torch.Tensor :param last_phonemes: indicates if a phoneme at an index is the last phoneme of a word or not :type last_phonemes: torch.Tensor

Returns:

spn_decision – indicates if a silent phoneme should be inserted after a phoneme

Return type:

torch.Tensor

training: bool
class speechbrain.lobes.models.FastSpeech2.FastSpeech2(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]

Bases: Module

The FastSpeech2 text-to-speech model. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers Simplified STRUCTURE: input->token embedding ->encoder ->duration predictor ->duration upsampler -> decoder -> output During training, teacher forcing is used (ground truth durations are used for upsampling) :param #encoder parameters: :param enc_num_layers: number of transformer layers (TransformerEncoderLayer) in encoder :type enc_num_layers: int :param enc_num_head: number of multi-head-attention (MHA) heads in encoder transformer layers :type enc_num_head: int :param enc_d_model: the number of expected features in the encoder :type enc_d_model: int :param enc_ffn_dim: the dimension of the feedforward network model :type enc_ffn_dim: int :param enc_k_dim: the dimension of the key :type enc_k_dim: int :param enc_v_dim: the dimension of the value :type enc_v_dim: int :param enc_dropout: Dropout for the encoder :type enc_dropout: float :param normalize_before: whether normalization should be applied before or after MHA or FFN in Transformer layers. :type normalize_before: bool :param ffn_type: whether to use convolutional layers instead of feed forward network inside tranformer layer :type ffn_type: str :param ffn_cnn_kernel_size_list: conv kernel size of 2 1d-convs if ffn_type is 1dcnn :type ffn_cnn_kernel_size_list: list of int :param #decoder parameters: :param dec_num_layers: number of transformer layers (TransformerEncoderLayer) in decoder :type dec_num_layers: int :param dec_num_head: number of multi-head-attention (MHA) heads in decoder transformer layers :type dec_num_head: int :param dec_d_model: the number of expected features in the decoder :type dec_d_model: int :param dec_ffn_dim: the dimension of the feedforward network model :type dec_ffn_dim: int :param dec_k_dim: the dimension of the key :type dec_k_dim: int :param dec_v_dim: the dimension of the value :type dec_v_dim: int :param dec_dropout: dropout for the decoder :type dec_dropout: float :param normalize_before: whether normalization should be applied before or after MHA or FFN in Transformer layers. :type normalize_before: bool :param ffn_type: whether to use convolutional layers instead of feed forward network inside tranformer layer. :type ffn_type: str :param ffn_cnn_kernel_size_list: conv kernel size of 2 1d-convs if ffn_type is 1dcnn :type ffn_cnn_kernel_size_list: list of int :param n_char: the number of symbols for the token embedding :type n_char: int :param n_mels: number of bins in mel spectrogram :type n_mels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float :param padding_idx: the index for padding :type padding_idx: int :param dur_pred_kernel_size: the convolution kernel size in duration predictor :type dur_pred_kernel_size: int :param pitch_pred_kernel_size: kernel size for pitch prediction. :type pitch_pred_kernel_size: int :param energy_pred_kernel_size: kernel size for energy prediction. :type energy_pred_kernel_size: int :param variance_predictor_dropout: dropout probability for variance predictor (duration/pitch/energy) :type variance_predictor_dropout: float

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> model = FastSpeech2(
...    enc_num_layers=6,
...    enc_num_head=2,
...    enc_d_model=384,
...    enc_ffn_dim=1536,
...    enc_k_dim=384,
...    enc_v_dim=384,
...    enc_dropout=0.1,
...    dec_num_layers=6,
...    dec_num_head=2,
...    dec_d_model=384,
...    dec_ffn_dim=1536,
...    dec_k_dim=384,
...    dec_v_dim=384,
...    dec_dropout=0.1,
...    normalize_before=False,
...    ffn_type='1dcnn',
...    ffn_cnn_kernel_size_list=[9, 1],
...    n_char=40,
...    n_mels=80,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    postnet_dropout=0.5,
...    padding_idx=0,
...    dur_pred_kernel_size=3,
...    pitch_pred_kernel_size=3,
...    energy_pred_kernel_size=3,
...    variance_predictor_dropout=0.5)
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> input_lengths = torch.tensor([5, 4])
>>> durations = torch.tensor([
...     [2, 4, 1, 5, 3],
...     [1, 2, 4, 3, 0],
... ])
>>> mel_post, postnet_output, predict_durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens = model(inputs, durations=durations)
>>> mel_post.shape, predict_durations.shape
(torch.Size([2, 15, 80]), torch.Size([2, 5]))
>>> predict_pitch.shape, predict_energy.shape
(torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))
forward(tokens, durations=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

forward pass for training and inference :param tokens: batch of input tokens :type tokens: torch.Tensor :param durations: batch of durations for each token. If it is None, the model will infer on predicted durations :type durations: torch.Tensor :param pitch: batch of pitch for each frame. If it is None, the model will infer on predicted pitches :type pitch: torch.Tensor :param energy: batch of energy for each frame. If it is None, the model will infer on predicted energies :type energy: torch.Tensor :param pace: scaling factor for durations :type pace: float :param pitch_rate: scaling factor for pitches :type pitch_rate: float :param energy_rate: scaling factor for energies :type energy_rate: float

Returns:

  • mel_post (torch.Tensor) – mel outputs from the decoder

  • postnet_output (torch.Tensor) – mel outputs from the postnet

  • predict_durations (torch.Tensor) – predicted durations of each token

  • predict_pitch (torch.Tensor) – predicted pitches of each token

  • avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None

  • predict_energy (torch.Tensor) – predicted energies of each token

  • avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None

  • mel_length – predicted lengths of mel spectrograms

training: bool
speechbrain.lobes.models.FastSpeech2.average_over_durations(values, durs)[source]

Average values over durations. :param values: shape: [B, 1, T_de] :type values: torch.Tensor :param durs: shape: [B, T_en] :type durs: torch.Tensor

Returns:

avg – shape: [B, 1, T_en]

Return type:

torch.Tensor

speechbrain.lobes.models.FastSpeech2.upsample(feats, durs, pace=1.0, padding_value=0.0)[source]

upsample encoder ouput according to durations :param feats: batch of input tokens :type feats: torch.tensor :param durs: durations to be used to upsample :type durs: torch.tensor :param pace: scaling factor for durations :type pace: float :param padding_value: padding index :type padding_value: int

Returns:

  • mel_post (torch.Tensor) – mel outputs from the decoder

  • predict_durations (torch.Tensor) – predicted durations for each token

class speechbrain.lobes.models.FastSpeech2.TextMelCollate[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step result: tuple

a tuple of tensors to be used as inputs/targets (

text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs

)

__call__(batch)[source]

Collate’s training batch from normalized text and mel-spectrogram :param batch: [text_normalized, mel_normalized] :type batch: list

class speechbrain.lobes.models.FastSpeech2.Loss(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, spn_loss_weight=1.0, spn_loss_max_epochs=8)[source]

Bases: Module

Loss Computation :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: int :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: int :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: int :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: int :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: int

forward(predictions, targets, current_epoch)[source]

Computes the value of the loss function and updates stats :param predictions: model predictions :type predictions: tuple :param targets: ground truth data :type targets: tuple

Returns:

loss – the loss value

Return type:

torch.Tensor

training: bool
speechbrain.lobes.models.FastSpeech2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, min_max_energy_norm, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal :param sample_rate: Sample rate of audio signal. :type sample_rate: int :param hop_length: Length of hop between STFT windows. :type hop_length: int :param win_length: Window size. :type win_length: int :param n_fft: Size of FFT. :type n_fft: int :param n_mels: Number of mel filterbanks. :type n_mels: int :param f_min: Minimum frequency. :type f_min: float :param f_max: Maximum frequency. :type f_max: float :param power: Exponent for the magnitude spectrogram. :type power: float :param normalized: Whether to normalize by magnitude after stft. :type normalized: bool :param norm: If “slaney”, divide the triangular mel weights by the width of the mel band :type norm: str or None :param mel_scale: Scale to use: “htk” or “slaney”. :type mel_scale: str :param compression: whether to do dynamic range compression :type compression: bool :param audio: input audio signal :type audio: torch.tensor

speechbrain.lobes.models.FastSpeech2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]

Dynamic range compression for audio signals

class speechbrain.lobes.models.FastSpeech2.SSIMLoss[source]

Bases: Module

SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity

sequence_mask(sequence_length, max_len=None)[source]

Create a sequence mask for filtering padding in a sequence tensor. :param sequence_length: Sequence lengths. :type sequence_length: torch.Tensor :param max_len: Maximum sequence length. Defaults to None. :type max_len: int

Returns:

mask

Return type:

[B, T_max]

sample_wise_min_max(x: Tensor, mask: Tensor)[source]

Min-Max normalize tensor through first dimension :param x: input tensor [B, D1, D2] :type x: torch.Tensor :param m: input mask [B, D1, 1] :type m: torch.Tensor

forward(y_hat, y, length)[source]
Parameters:
Returns:

loss

Return type:

Average loss value in range [0, 1] masked by the length.

training: bool