speechbrain.lobes.models.FastSpeech2 module

Neural network modules for the FastSpeech 2: Fast and High-Quality End-to-End Text to Speech synthesis model Authors * Sathvik Udupa 2022 * Pradnya Kandarkar 2023 * Yingzhi Wang 2023

Summary

Classes:

`AlignmentNetwork`	Learns the alignment between the input text and the spectrogram with Gaussian Attention.
`BinaryAlignmentLoss`	Binary loss that forces soft alignments to match the hard alignments as explained in `https://arxiv.org/pdf/2108.10447.pdf`.
`DurationPredictor`	Duration predictor layer :param in_channels: input feature dimension for convolution layers :type in_channels: int :param out_channels: output feature dimension for convolution layers :type out_channels: int :param kernel_size: duration predictor convolution kernal size :type kernel_size: int :param dropout: dropout probability, 0 by default :type dropout: float
`EncoderPreNet`	Embedding layer for tokens :param n_vocab: size of the dictionary of embeddings :type n_vocab: int :param blank_id: padding index :type blank_id: int :param out_channels: the size of each embedding vector :type out_channels: int
`FastSpeech2`	The FastSpeech2 text-to-speech model.
`FastSpeech2WithAlignment`	The FastSpeech2 text-to-speech model with internal alignment.
`ForwardSumLoss`	CTC alignment loss :param blank_logprob: :type blank_logprob: pad value
`Loss`	Loss Computation :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: int :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: int :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: int :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: int :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: int
`LossWithAlignment`	Loss computation including internal aligner :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param ssim_loss_weight: weight for the ssim loss :type ssim_loss_weight: float :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: float :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: float :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: float :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: float :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: float :param aligner_loss_weight: weight for the alignment loss :type aligner_loss_weight: float :param binary_alignment_loss_weight: weight for the postnet mel loss :type binary_alignment_loss_weight: float :param binary_alignment_loss_warmup_epochs: Number of epochs to gradually increase the impact of binary loss.
`PostNet`	FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float
`SPNPredictor`	This module for the silent phoneme predictor.
`SSIMLoss`	SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity
`TextMelCollate`	Zero-pads model inputs and targets based on number of frames per step result: tuple a tuple of tensors to be used as inputs/targets ( text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs )
`TextMelCollateWithAlignment`	Zero-pads model inputs and targets based on number of frames per step result: tuple a tuple of tensors to be used as inputs/targets ( text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs )

Functions:

`average_over_durations`	Average values over durations.
`dynamic_range_compression`	Dynamic range compression for audio signals
`maximum_path_numpy`	Monotonic alignment search algorithm, numpy works faster than the torch implementation.
`mel_spectogram`	calculates MelSpectrogram for a raw audio signal :param sample_rate: Sample rate of audio signal.
`upsample`	upsample encoder ouput according to durations :param feats: batch of input tokens :type feats: torch.tensor :param durs: durations to be used to upsample :type durs: torch.tensor :param pace: scaling factor for durations :type pace: float :param padding_value: padding index :type padding_value: int

Reference

class speechbrain.lobes.models.FastSpeech2.EncoderPreNet(n_vocab, blank_id, out_channels=512)[source]

Bases: Module

Embedding layer for tokens :param n_vocab: size of the dictionary of embeddings :type n_vocab: int :param blank_id: padding index :type blank_id: int :param out_channels: the size of each embedding vector :type out_channels: int

Example

>>> from speechbrain.nnet.embedding import Embedding
>>> from speechbrain.lobes.models.FastSpeech2 import EncoderPreNet
>>> encoder_prenet_layer = EncoderPreNet(n_vocab=40, blank_id=0, out_channels=384)
>>> x = torch.rand(3, 5)
>>> y = encoder_prenet_layer(x)
>>> y.shape
torch.Size([3, 5, 384])

forward(x)[source]

Computes the forward pass :param x: a (batch, tokens) input tensor :type x: torch.Tensor

Returns:: output – the embedding layer output
Return type:: torch.Tensor

training: bool

class speechbrain.lobes.models.FastSpeech2.PostNet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, postnet_dropout=0.5)[source]

Bases: Module

FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float

forward(x)[source]

Computes the forward pass :param x: a (batch, time_steps, features) input tensor :type x: torch.Tensor

Returns:: output – the spectrogram predicted
Return type:: torch.Tensor

training: bool

class speechbrain.lobes.models.FastSpeech2.DurationPredictor(in_channels, out_channels, kernel_size, dropout=0.0, n_units=1)[source]

Bases: Module

Duration predictor layer :param in_channels: input feature dimension for convolution layers :type in_channels: int :param out_channels: output feature dimension for convolution layers :type out_channels: int :param kernel_size: duration predictor convolution kernal size :type kernel_size: int :param dropout: dropout probability, 0 by default :type dropout: float

Example

>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> duration_predictor_layer = DurationPredictor(in_channels=384, out_channels=384, kernel_size=3)
>>> x = torch.randn(3, 400, 384)
>>> mask = torch.ones(3, 400, 384)
>>> y = duration_predictor_layer(x, mask)
>>> y.shape
torch.Size([3, 400, 1])

forward(x, x_mask)[source]

Computes the forward pass :param x: a (batch, time_steps, features) input tensor :type x: torch.Tensor :param x_mask: mask of input tensor :type x_mask: torch.Tensor

Returns:: output – the duration predictor outputs
Return type:: torch.Tensor

training: bool

class speechbrain.lobes.models.FastSpeech2.SPNPredictor(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, padding_idx)[source]

Bases: Module

This module for the silent phoneme predictor. It receives phoneme sequences without any silent phoneme token as input and predicts whether a silent phoneme should be inserted after a position. This is to avoid the issue of fast pace at inference time due to having no silent phoneme tokens in the input sequence.

Parameters:

enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder
enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers
enc_d_model (int) – the number of expected features in the encoder
enc_ffn_dim (int) – the dimension of the feedforward network model
enc_k_dim (int) – the dimension of the key
enc_v_dim (int) – the dimension of the value
enc_dropout (float) – Dropout for the encoder
normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type (str) – whether to use convolutional layers instead of feed forward network inside tranformer layer
ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
n_char (int) – the number of symbols for the token embedding
padding_idx (int) – the index for padding

forward(tokens, last_phonemes)[source]

forward pass for the module :param tokens: input tokens without silent phonemes :type tokens: torch.Tensor :param last_phonemes: indicates if a phoneme at an index is the last phoneme of a word or not :type last_phonemes: torch.Tensor

Returns:: spn_decision – indicates if a silent phoneme should be inserted after a phoneme
Return type:: torch.Tensor

infer(tokens, last_phonemes)[source]

inference function :param tokens: input tokens without silent phonemes :type tokens: torch.Tensor :param last_phonemes: indicates if a phoneme at an index is the last phoneme of a word or not :type last_phonemes: torch.Tensor

Returns:: spn_decision – indicates if a silent phoneme should be inserted after a phoneme
Return type:: torch.Tensor

training: bool

class speechbrain.lobes.models.FastSpeech2.FastSpeech2(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]

Bases: Module

The FastSpeech2 text-to-speech model. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers Simplified STRUCTURE: input->token embedding ->encoder ->duration/pitch/energy predictor ->duration upsampler -> decoder -> output During training, teacher forcing is used (ground truth durations are used for upsampling) :param #encoder parameters: :param enc_num_layers: number of transformer layers (TransformerEncoderLayer) in encoder :type enc_num_layers: int :param enc_num_head: number of multi-head-attention (MHA) heads in encoder transformer layers :type enc_num_head: int :param enc_d_model: the number of expected features in the encoder :type enc_d_model: int :param enc_ffn_dim: the dimension of the feedforward network model :type enc_ffn_dim: int :param enc_k_dim: the dimension of the key :type enc_k_dim: int :param enc_v_dim: the dimension of the value :type enc_v_dim: int :param enc_dropout: Dropout for the encoder :type enc_dropout: float :param normalize_before: whether normalization should be applied before or after MHA or FFN in Transformer layers. :type normalize_before: bool :param ffn_type: whether to use convolutional layers instead of feed forward network inside tranformer layer :type ffn_type: str :param ffn_cnn_kernel_size_list: conv kernel size of 2 1d-convs if ffn_type is 1dcnn :type ffn_cnn_kernel_size_list: list of int :param #decoder parameters: :param dec_num_layers: number of transformer layers (TransformerEncoderLayer) in decoder :type dec_num_layers: int :param dec_num_head: number of multi-head-attention (MHA) heads in decoder transformer layers :type dec_num_head: int :param dec_d_model: the number of expected features in the decoder :type dec_d_model: int :param dec_ffn_dim: the dimension of the feedforward network model :type dec_ffn_dim: int :param dec_k_dim: the dimension of the key :type dec_k_dim: int :param dec_v_dim: the dimension of the value :type dec_v_dim: int :param dec_dropout: dropout for the decoder :type dec_dropout: float :param normalize_before: whether normalization should be applied before or after MHA or FFN in Transformer layers. :type normalize_before: bool :param ffn_type: whether to use convolutional layers instead of feed forward network inside tranformer layer. :type ffn_type: str :param ffn_cnn_kernel_size_list: conv kernel size of 2 1d-convs if ffn_type is 1dcnn :type ffn_cnn_kernel_size_list: list of int :param n_char: the number of symbols for the token embedding :type n_char: int :param n_mels: number of bins in mel spectrogram :type n_mels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float :param padding_idx: the index for padding :type padding_idx: int :param dur_pred_kernel_size: the convolution kernel size in duration predictor :type dur_pred_kernel_size: int :param pitch_pred_kernel_size: kernel size for pitch prediction. :type pitch_pred_kernel_size: int :param energy_pred_kernel_size: kernel size for energy prediction. :type energy_pred_kernel_size: int :param variance_predictor_dropout: dropout probability for variance predictor (duration/pitch/energy) :type variance_predictor_dropout: float

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2
>>> model = FastSpeech2(
...    enc_num_layers=6,
...    enc_num_head=2,
...    enc_d_model=384,
...    enc_ffn_dim=1536,
...    enc_k_dim=384,
...    enc_v_dim=384,
...    enc_dropout=0.1,
...    dec_num_layers=6,
...    dec_num_head=2,
...    dec_d_model=384,
...    dec_ffn_dim=1536,
...    dec_k_dim=384,
...    dec_v_dim=384,
...    dec_dropout=0.1,
...    normalize_before=False,
...    ffn_type='1dcnn',
...    ffn_cnn_kernel_size_list=[9, 1],
...    n_char=40,
...    n_mels=80,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    postnet_dropout=0.5,
...    padding_idx=0,
...    dur_pred_kernel_size=3,
...    pitch_pred_kernel_size=3,
...    energy_pred_kernel_size=3,
...    variance_predictor_dropout=0.5)
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> input_lengths = torch.tensor([5, 4])
>>> durations = torch.tensor([
...     [2, 4, 1, 5, 3],
...     [1, 2, 4, 3, 0],
... ])
>>> mel_post, postnet_output, predict_durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens = model(inputs, durations=durations)
>>> mel_post.shape, predict_durations.shape
(torch.Size([2, 15, 80]), torch.Size([2, 5]))
>>> predict_pitch.shape, predict_energy.shape
(torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))

forward(tokens, durations=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

forward pass for training and inference :param tokens: batch of input tokens :type tokens: torch.Tensor :param durations: batch of durations for each token. If it is None, the model will infer on predicted durations :type durations: torch.Tensor :param pitch: batch of pitch for each frame. If it is None, the model will infer on predicted pitches :type pitch: torch.Tensor :param energy: batch of energy for each frame. If it is None, the model will infer on predicted energies :type energy: torch.Tensor :param pace: scaling factor for durations :type pace: float :param pitch_rate: scaling factor for pitches :type pitch_rate: float :param energy_rate: scaling factor for energies :type energy_rate: float

Returns:

mel_post (torch.Tensor) – mel outputs from the decoder
postnet_output (torch.Tensor) – mel outputs from the postnet
predict_durations (torch.Tensor) – predicted durations of each token
predict_pitch (torch.Tensor) – predicted pitches of each token
avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None
predict_energy (torch.Tensor) – predicted energies of each token
avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None
mel_length – predicted lengths of mel spectrograms

training: bool

speechbrain.lobes.models.FastSpeech2.average_over_durations(values, durs)[source]

Average values over durations. :param values: shape: [B, 1, T_de] :type values: torch.Tensor :param durs: shape: [B, T_en] :type durs: torch.Tensor

Returns:: avg – shape: [B, 1, T_en]
Return type:: torch.Tensor

speechbrain.lobes.models.FastSpeech2.upsample(feats, durs, pace=1.0, padding_value=0.0)[source]

upsample encoder ouput according to durations :param feats: batch of input tokens :type feats: torch.tensor :param durs: durations to be used to upsample :type durs: torch.tensor :param pace: scaling factor for durations :type pace: float :param padding_value: padding index :type padding_value: int

Returns:

mel_post (torch.Tensor) – mel outputs from the decoder
predict_durations (torch.Tensor) – predicted durations for each token

class speechbrain.lobes.models.FastSpeech2.TextMelCollate[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step result: tuple

a tuple of tensors to be used as inputs/targets (

text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs

)

__call__(batch)[source]: Collate’s training batch from normalized text and mel-spectrogram :param batch: [text_normalized, mel_normalized] :type batch: list

class speechbrain.lobes.models.FastSpeech2.Loss(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, spn_loss_weight=1.0, spn_loss_max_epochs=8)[source]

Bases: Module

Loss Computation :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: int :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: int :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: int :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: int :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: int

forward(predictions, targets, current_epoch)[source]

Computes the value of the loss function and updates stats :param predictions: model predictions :type predictions: tuple :param targets: ground truth data :type targets: tuple

Returns:: loss – the loss value
Return type:: torch.Tensor

training: bool

speechbrain.lobes.models.FastSpeech2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, min_max_energy_norm, norm, mel_scale, compression, audio)[source]: calculates MelSpectrogram for a raw audio signal :param sample_rate: Sample rate of audio signal. :type sample_rate: int :param hop_length: Length of hop between STFT windows. :type hop_length: int :param win_length: Window size. :type win_length: int :param n_fft: Size of FFT. :type n_fft: int :param n_mels: Number of mel filterbanks. :type n_mels: int :param f_min: Minimum frequency. :type f_min: float :param f_max: Maximum frequency. :type f_max: float :param power: Exponent for the magnitude spectrogram. :type power: float :param normalized: Whether to normalize by magnitude after stft. :type normalized: bool :param norm: If “slaney”, divide the triangular mel weights by the width of the mel band :type norm: str or None :param mel_scale: Scale to use: “htk” or “slaney”. :type mel_scale: str :param compression: whether to do dynamic range compression :type compression: bool :param audio: input audio signal :type audio: torch.tensor

speechbrain.lobes.models.FastSpeech2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]: Dynamic range compression for audio signals

class speechbrain.lobes.models.FastSpeech2.SSIMLoss[source]

Bases: Module

SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity

sequence_mask(sequence_length, max_len=None)[source]

Create a sequence mask for filtering padding in a sequence tensor. :param sequence_length: Sequence lengths. :type sequence_length: torch.Tensor :param max_len: Maximum sequence length. Defaults to None. :type max_len: int

Returns:: mask
Return type:: [B, T_max]

sample_wise_min_max(x: Tensor, mask: Tensor)[source]: Min-Max normalize tensor through first dimension :param x: input tensor [B, D1, D2] :type x: torch.Tensor :param m: input mask [B, D1, 1] :type m: torch.Tensor

forward(y_hat, y, length)[source]

Parameters:

y_hat (torch.Tensor) – model prediction values [B, T, D].
y (torch.Tensor) – target values [B, T, D].
length (torch.Tensor) – length of each sample in a batch for masking.

Returns:

loss

Return type:

Average loss value in range [0, 1] masked by the length.

training: bool

class speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment[source]

Bases: object

Zero-pads model inputs and targets based on number of frames per step result: tuple

a tuple of tensors to be used as inputs/targets (

text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs

)

__call__(batch)[source]: Collate’s training batch from normalized text and mel-spectrogram :param batch: [text_normalized, mel_normalized] :type batch: list

speechbrain.lobes.models.FastSpeech2.maximum_path_numpy(value, mask)[source]

Monotonic alignment search algorithm, numpy works faster than the torch implementation. :param value: input alignment values [b, t_x, t_y] :type value: torch.Tensor :param mask: input alignment mask [b, t_x, t_y] :type mask: torch.Tensor

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import maximum_path_numpy
>>> alignment = torch.rand(2, 5, 100)
>>> mask = torch.ones(2, 5, 100)
>>> hard_alignments = maximum_path_numpy(alignment, mask)

class speechbrain.lobes.models.FastSpeech2.AlignmentNetwork(in_query_channels=80, in_key_channels=512, attn_channels=80, temperature=0.0005)[source]

Bases: Module

Learns the alignment between the input text and the spectrogram with Gaussian Attention.

query -> conv1d -> relu -> conv1d -> relu -> conv1d -> L2_dist -> softmax -> alignment key -> conv1d -> relu -> conv1d - - - - - - - - - - - -^

Parameters:

in_query_channels (int) – Number of channels in the query network. Defaults to 80.
in_key_channels (int) – Number of channels in the key network. Defaults to 512.
attn_channels (int) – Number of inner channels in the attention layers. Defaults to 80.
temperature (float) – Temperature for the softmax. Defaults to 0.0005.

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import AlignmentNetwork
>>> aligner = AlignmentNetwork(
...     in_query_channels=80,
...     in_key_channels=512,
...     attn_channels=80,
...     temperature=0.0005,
... )
>>> phoneme_feats = torch.rand(2, 512, 20)
>>> mels = torch.rand(2, 80, 100)
>>> alignment_soft, alignment_logprob = aligner(mels, phoneme_feats, None, None)
>>> alignment_soft.shape, alignment_logprob.shape
(torch.Size([2, 1, 100, 20]), torch.Size([2, 1, 100, 20]))

forward(queries, keys, mask, attn_prior)[source]

Forward pass of the aligner encoder. :param queries: the query tensor [B, C, T_de] :type queries: torch.Tensor :param keys: the query tensor [B, C_emb, T_en] :type keys: torch.Tensor :param mask: the query mask[B, T_de] :type mask: torch.Tensor :param attn_prior: the prior attention tensor [B, 1, T_en, T_de] :type attn_prior: torch.Tensor

Returns:

attn (torch.tensor) – soft attention [B, 1, T_en, T_de]
attn_logp (torch.tensor) – log probabilities [B, 1, T_en , T_de]

training: bool

class speechbrain.lobes.models.FastSpeech2.FastSpeech2WithAlignment(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, in_query_channels, in_key_channels, attn_channels, temperature, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]

Bases: Module

The FastSpeech2 text-to-speech model with internal alignment. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers. Certain parts are adopted from the following implementation: https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/models/forward_tts.py

Simplified STRUCTURE: input -> token embedding -> encoder -> aligner -> duration/pitch/energy -> upsampler -> decoder -> output

Parameters:

parameters (#decoder) –
enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder
enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers
enc_d_model (int) – the number of expected features in the encoder
enc_ffn_dim (int) – the dimension of the feedforward network model
enc_k_dim (int) – the dimension of the key
enc_v_dim (int) – the dimension of the value
enc_dropout (float) – Dropout for the encoder
normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type (str) – whether to use convolutional layers instead of feed forward network inside tranformer layer
ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
parameters –
in_query_channels (int) – Number of channels in the query network.
in_key_channels (int) – Number of channels in the key network.
attn_channels (int) – Number of inner channels in the attention layers.
temperature (float) – Temperature for the softmax.
parameters –
dec_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in decoder
dec_num_head (int) – number of multi-head-attention (MHA) heads in decoder transformer layers
dec_d_model (int) – the number of expected features in the decoder
dec_ffn_dim (int) – the dimension of the feedforward network model
dec_k_dim (int) – the dimension of the key
dec_v_dim (int) – the dimension of the value
dec_dropout (float) – dropout for the decoder
normalize_before – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type – whether to use convolutional layers instead of feed forward network inside tranformer layer.
ffn_cnn_kernel_size_list – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
n_char (int) – the number of symbols for the token embedding
n_mels (int) – number of bins in mel spectrogram
postnet_embedding_dim (int) – output feature dimension for convolution layers
postnet_kernel_size (int) – postnet convolution kernal size
postnet_n_convolutions (int) – number of convolution layers
postnet_dropout (float) – dropout probability fot postnet
padding_idx (int) – the index for padding
dur_pred_kernel_size (int) – the convolution kernel size in duration predictor
pitch_pred_kernel_size (int) – kernel size for pitch prediction.
energy_pred_kernel_size (int) – kernel size for energy prediction.
variance_predictor_dropout (float) – dropout probability for variance predictor (duration/pitch/energy)

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2WithAlignment
>>> model = FastSpeech2WithAlignment(
...    enc_num_layers=6,
...    enc_num_head=2,
...    enc_d_model=384,
...    enc_ffn_dim=1536,
...    enc_k_dim=384,
...    enc_v_dim=384,
...    enc_dropout=0.1,
...    in_query_channels=80,
...    in_key_channels=384,
...    attn_channels=80,
...    temperature=0.0005,
...    dec_num_layers=6,
...    dec_num_head=2,
...    dec_d_model=384,
...    dec_ffn_dim=1536,
...    dec_k_dim=384,
...    dec_v_dim=384,
...    dec_dropout=0.1,
...    normalize_before=False,
...    ffn_type='1dcnn',
...    ffn_cnn_kernel_size_list=[9, 1],
...    n_char=40,
...    n_mels=80,
...    postnet_embedding_dim=512,
...    postnet_kernel_size=5,
...    postnet_n_convolutions=5,
...    postnet_dropout=0.5,
...    padding_idx=0,
...    dur_pred_kernel_size=3,
...    pitch_pred_kernel_size=3,
...    energy_pred_kernel_size=3,
...    variance_predictor_dropout=0.5)
>>> inputs = torch.tensor([
...     [13, 12, 31, 14, 19],
...     [31, 16, 30, 31, 0],
... ])
>>> mels = torch.rand(2, 100, 80)
>>> mel_post, postnet_output, durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens, alignment_durations, alignment_soft, alignment_logprob, alignment_mas = model(inputs, mels)
>>> mel_post.shape, durations.shape
(torch.Size([2, 100, 80]), torch.Size([2, 5]))
>>> predict_pitch.shape, predict_energy.shape
(torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))
>>> alignment_soft.shape, alignment_mas.shape
(torch.Size([2, 100, 5]), torch.Size([2, 100, 5]))

forward(tokens, mel_spectograms=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

forward pass for training and inference :param tokens: batch of input tokens :type tokens: torch.Tensor :param mel_spectograms: batch of mel_spectograms (used only for training) :type mel_spectograms: torch.Tensor :param pitch: batch of pitch for each frame. If it is None, the model will infer on predicted pitches :type pitch: torch.Tensor :param energy: batch of energy for each frame. If it is None, the model will infer on predicted energies :type energy: torch.Tensor :param pace: scaling factor for durations :type pace: float :param pitch_rate: scaling factor for pitches :type pitch_rate: float :param energy_rate: scaling factor for energies :type energy_rate: float

Returns:

mel_post (torch.Tensor) – mel outputs from the decoder
postnet_output (torch.Tensor) – mel outputs from the postnet
predict_durations (torch.Tensor) – predicted durations of each token
predict_pitch (torch.Tensor) – predicted pitches of each token
avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None
predict_energy (torch.Tensor) – predicted energies of each token
avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None
mel_length – predicted lengths of mel spectrograms
alignment_durations – durations from the hard alignment map
alignment_soft (torch.Tensor) – soft alignment potentials
alignment_logprob (torch.Tensor) – log scale alignment potentials
alignment_mas (torch.Tensor) – hard alignment map

training: bool

class speechbrain.lobes.models.FastSpeech2.LossWithAlignment(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, aligner_loss_weight, binary_alignment_loss_weight, binary_alignment_loss_warmup_epochs, binary_alignment_loss_max_epochs)[source]

Bases: Module

Loss computation including internal aligner :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param ssim_loss_weight: weight for the ssim loss :type ssim_loss_weight: float :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: float :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: float :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: float :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: float :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: float :param aligner_loss_weight: weight for the alignment loss :type aligner_loss_weight: float :param binary_alignment_loss_weight: weight for the postnet mel loss :type binary_alignment_loss_weight: float :param binary_alignment_loss_warmup_epochs: Number of epochs to gradually increase the impact of binary loss. :type binary_alignment_loss_warmup_epochs: int :param binary_alignment_loss_max_epochs: From this epoch on the impact of binary loss is ignored. :type binary_alignment_loss_max_epochs: int

forward(predictions, targets, current_epoch)[source]

Computes the value of the loss function and updates stats :param predictions: model predictions :type predictions: tuple :param targets: ground truth data :type targets: tuple :param current_epoch: used to determinate the start/end of the binary alignment loss :type current_epoch: int

Returns:: loss – the loss value
Return type:: torch.Tensor

training: bool

class speechbrain.lobes.models.FastSpeech2.ForwardSumLoss(blank_logprob=-1)[source]

Bases: Module

CTC alignment loss :param blank_logprob: :type blank_logprob: pad value

Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import ForwardSumLoss
>>> loss_func = ForwardSumLoss()
>>> attn_logprob = torch.rand(2, 1, 100, 5)
>>> key_lens = torch.tensor([5, 5])
>>> query_lens = torch.tensor([100, 100])
>>> loss = loss_func(attn_logprob, key_lens, query_lens)

forward(attn_logprob, key_lens, query_lens)[source]

Parameters:

attn_logprob (torch.Tensor) – log scale alignment potentials [B, 1, query_lens, key_lens]
key_lens (torch.Tensor) – mel lengths
query_lens (torch.Tensor) – phoneme lengths

training: bool

class speechbrain.lobes.models.FastSpeech2.BinaryAlignmentLoss[source]

Bases: Module

Binary loss that forces soft alignments to match the hard alignments as explained in https://arxiv.org/pdf/2108.10447.pdf. .. rubric:: Example

>>> import torch
>>> from speechbrain.lobes.models.FastSpeech2 import BinaryAlignmentLoss
>>> loss_func = BinaryAlignmentLoss()
>>> alignment_hard = torch.randint(0, 2, (2, 100, 5))
>>> alignment_soft = torch.rand(2, 100, 5)
>>> loss = loss_func(alignment_hard, alignment_soft)

forward(alignment_hard, alignment_soft)[source]

alignment_hard: torch.Tensor: hard alignment map [B, mel_lens, phoneme_lens]
alignment_soft: torch.Tensor: soft alignment potentials [B, mel_lens, phoneme_lens]

training: bool