speechbrain.lobes.models.FastSpeech2 module
Neural network modules for the FastSpeech 2: Fast and High-Quality End-to-End Text to Speech synthesis model Authors * Sathvik Udupa 2022 * Pradnya Kandarkar 2023 * Yingzhi Wang 2023
Summary
Classes:
Duration predictor layer :param in_channels: input feature dimension for convolution layers :type in_channels: int :param out_channels: output feature dimension for convolution layers :type out_channels: int :param kernel_size: duration predictor convolution kernal size :type kernel_size: int :param dropout: dropout probability, 0 by default :type dropout: float |
|
Embedding layer for tokens :param n_vocab: size of the dictionary of embeddings :type n_vocab: int :param blank_id: padding index :type blank_id: int :param out_channels: the size of each embedding vector :type out_channels: int |
|
The FastSpeech2 text-to-speech model. |
|
Loss Computation :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: int :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: int :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: int :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: int :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: int |
|
Computation of the positional embeddings. |
|
FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float |
|
This module for the silent phoneme predictor. |
|
SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity |
|
Zero-pads model inputs and targets based on number of frames per step result: tuple a tuple of tensors to be used as inputs/targets ( text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs ) |
Functions:
Average values over durations. |
|
Dynamic range compression for audio signals |
|
calculates MelSpectrogram for a raw audio signal :param sample_rate: Sample rate of audio signal. |
|
upsample encoder ouput according to durations :param feats: batch of input tokens :type feats: torch.tensor :param durs: durations to be used to upsample :type durs: torch.tensor :param pace: scaling factor for durations :type pace: float :param padding_value: padding index :type padding_value: int |
Reference
- class speechbrain.lobes.models.FastSpeech2.PositionalEmbedding(embed_dim)[source]
Bases:
Module
Computation of the positional embeddings. :param embed_dim: dimensionality of the embeddings. :type embed_dim: int
- forward(seq_len, mask, dtype)[source]
Computes the forward pass :param seq_len: length of the sequence :type seq_len: int :param mask: mask applied to the positional embeddings :type mask: torch.tensor :param dtype: dtype of the embeddings :type dtype: str
- Returns:
pos_emb – the tensor with positional embeddings
- Return type:
- class speechbrain.lobes.models.FastSpeech2.EncoderPreNet(n_vocab, blank_id, out_channels=512)[source]
Bases:
Module
Embedding layer for tokens :param n_vocab: size of the dictionary of embeddings :type n_vocab: int :param blank_id: padding index :type blank_id: int :param out_channels: the size of each embedding vector :type out_channels: int
Example
>>> from speechbrain.nnet.embedding import Embedding >>> from speechbrain.lobes.models.FastSpeech2 import EncoderPreNet >>> encoder_prenet_layer = EncoderPreNet(n_vocab=40, blank_id=0, out_channels=384) >>> x = torch.rand(3, 5) >>> y = encoder_prenet_layer(x) >>> y.shape torch.Size([3, 5, 384])
- class speechbrain.lobes.models.FastSpeech2.PostNet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, postnet_dropout=0.5)[source]
Bases:
Module
FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float
- class speechbrain.lobes.models.FastSpeech2.DurationPredictor(in_channels, out_channels, kernel_size, dropout=0.0, n_units=1)[source]
Bases:
Module
Duration predictor layer :param in_channels: input feature dimension for convolution layers :type in_channels: int :param out_channels: output feature dimension for convolution layers :type out_channels: int :param kernel_size: duration predictor convolution kernal size :type kernel_size: int :param dropout: dropout probability, 0 by default :type dropout: float
Example
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2 >>> duration_predictor_layer = DurationPredictor(in_channels=384, out_channels=384, kernel_size=3) >>> x = torch.randn(3, 400, 384) >>> mask = torch.ones(3, 400, 384) >>> y = duration_predictor_layer(x, mask) >>> y.shape torch.Size([3, 400, 1])
- class speechbrain.lobes.models.FastSpeech2.SPNPredictor(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, padding_idx)[source]
Bases:
Module
This module for the silent phoneme predictor. It receives phoneme sequences without any silent phoneme token as input and predicts whether a silent phoneme should be inserted after a position. This is to avoid the issue of fast pace at inference time due to having no silent phoneme tokens in the input sequence.
- Parameters:
enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder
enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers
enc_d_model (int) – the number of expected features in the encoder
enc_ffn_dim (int) – the dimension of the feedforward network model
enc_k_dim (int) – the dimension of the key
enc_v_dim (int) – the dimension of the value
enc_dropout (float) – Dropout for the encoder
normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type (str) – whether to use convolutional layers instead of feed forward network inside tranformer layer
ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
n_char (int) – the number of symbols for the token embedding
padding_idx (int) – the index for padding
- forward(tokens, last_phonemes)[source]
forward pass for the module :param tokens: input tokens without silent phonemes :type tokens: torch.Tensor :param last_phonemes: indicates if a phoneme at an index is the last phoneme of a word or not :type last_phonemes: torch.Tensor
- Returns:
spn_decision – indicates if a silent phoneme should be inserted after a phoneme
- Return type:
- infer(tokens, last_phonemes)[source]
inference function :param tokens: input tokens without silent phonemes :type tokens: torch.Tensor :param last_phonemes: indicates if a phoneme at an index is the last phoneme of a word or not :type last_phonemes: torch.Tensor
- Returns:
spn_decision – indicates if a silent phoneme should be inserted after a phoneme
- Return type:
- class speechbrain.lobes.models.FastSpeech2.FastSpeech2(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]
Bases:
Module
The FastSpeech2 text-to-speech model. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers Simplified STRUCTURE: input->token embedding ->encoder ->duration predictor ->duration upsampler -> decoder -> output During training, teacher forcing is used (ground truth durations are used for upsampling) :param #encoder parameters: :param enc_num_layers: number of transformer layers (TransformerEncoderLayer) in encoder :type enc_num_layers: int :param enc_num_head: number of multi-head-attention (MHA) heads in encoder transformer layers :type enc_num_head: int :param enc_d_model: the number of expected features in the encoder :type enc_d_model: int :param enc_ffn_dim: the dimension of the feedforward network model :type enc_ffn_dim: int :param enc_k_dim: the dimension of the key :type enc_k_dim: int :param enc_v_dim: the dimension of the value :type enc_v_dim: int :param enc_dropout: Dropout for the encoder :type enc_dropout: float :param normalize_before: whether normalization should be applied before or after MHA or FFN in Transformer layers. :type normalize_before: bool :param ffn_type: whether to use convolutional layers instead of feed forward network inside tranformer layer :type ffn_type: str :param ffn_cnn_kernel_size_list: conv kernel size of 2 1d-convs if ffn_type is 1dcnn :type ffn_cnn_kernel_size_list: list of int :param #decoder parameters: :param dec_num_layers: number of transformer layers (TransformerEncoderLayer) in decoder :type dec_num_layers: int :param dec_num_head: number of multi-head-attention (MHA) heads in decoder transformer layers :type dec_num_head: int :param dec_d_model: the number of expected features in the decoder :type dec_d_model: int :param dec_ffn_dim: the dimension of the feedforward network model :type dec_ffn_dim: int :param dec_k_dim: the dimension of the key :type dec_k_dim: int :param dec_v_dim: the dimension of the value :type dec_v_dim: int :param dec_dropout: dropout for the decoder :type dec_dropout: float :param normalize_before: whether normalization should be applied before or after MHA or FFN in Transformer layers. :type normalize_before: bool :param ffn_type: whether to use convolutional layers instead of feed forward network inside tranformer layer. :type ffn_type: str :param ffn_cnn_kernel_size_list: conv kernel size of 2 1d-convs if ffn_type is 1dcnn :type ffn_cnn_kernel_size_list: list of int :param n_char: the number of symbols for the token embedding :type n_char: int :param n_mels: number of bins in mel spectrogram :type n_mels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernal size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability fot postnet :type postnet_dropout: float :param padding_idx: the index for padding :type padding_idx: int :param dur_pred_kernel_size: the convolution kernel size in duration predictor :type dur_pred_kernel_size: int :param pitch_pred_kernel_size: kernel size for pitch prediction. :type pitch_pred_kernel_size: int :param energy_pred_kernel_size: kernel size for energy prediction. :type energy_pred_kernel_size: int :param variance_predictor_dropout: dropout probability for variance predictor (duration/pitch/energy) :type variance_predictor_dropout: float
Example
>>> import torch >>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2 >>> model = FastSpeech2( ... enc_num_layers=6, ... enc_num_head=2, ... enc_d_model=384, ... enc_ffn_dim=1536, ... enc_k_dim=384, ... enc_v_dim=384, ... enc_dropout=0.1, ... dec_num_layers=6, ... dec_num_head=2, ... dec_d_model=384, ... dec_ffn_dim=1536, ... dec_k_dim=384, ... dec_v_dim=384, ... dec_dropout=0.1, ... normalize_before=False, ... ffn_type='1dcnn', ... ffn_cnn_kernel_size_list=[9, 1], ... n_char=40, ... n_mels=80, ... postnet_embedding_dim=512, ... postnet_kernel_size=5, ... postnet_n_convolutions=5, ... postnet_dropout=0.5, ... padding_idx=0, ... dur_pred_kernel_size=3, ... pitch_pred_kernel_size=3, ... energy_pred_kernel_size=3, ... variance_predictor_dropout=0.5) >>> inputs = torch.tensor([ ... [13, 12, 31, 14, 19], ... [31, 16, 30, 31, 0], ... ]) >>> input_lengths = torch.tensor([5, 4]) >>> durations = torch.tensor([ ... [2, 4, 1, 5, 3], ... [1, 2, 4, 3, 0], ... ]) >>> mel_post, postnet_output, predict_durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens = model(inputs, durations=durations) >>> mel_post.shape, predict_durations.shape (torch.Size([2, 15, 80]), torch.Size([2, 5])) >>> predict_pitch.shape, predict_energy.shape (torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))
- forward(tokens, durations=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
forward pass for training and inference :param tokens: batch of input tokens :type tokens: torch.Tensor :param durations: batch of durations for each token. If it is None, the model will infer on predicted durations :type durations: torch.Tensor :param pitch: batch of pitch for each frame. If it is None, the model will infer on predicted pitches :type pitch: torch.Tensor :param energy: batch of energy for each frame. If it is None, the model will infer on predicted energies :type energy: torch.Tensor :param pace: scaling factor for durations :type pace: float :param pitch_rate: scaling factor for pitches :type pitch_rate: float :param energy_rate: scaling factor for energies :type energy_rate: float
- Returns:
mel_post (torch.Tensor) – mel outputs from the decoder
postnet_output (torch.Tensor) – mel outputs from the postnet
predict_durations (torch.Tensor) – predicted durations of each token
predict_pitch (torch.Tensor) – predicted pitches of each token
avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None
predict_energy (torch.Tensor) – predicted energies of each token
avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None
mel_length – predicted lengths of mel spectrograms
- speechbrain.lobes.models.FastSpeech2.average_over_durations(values, durs)[source]
Average values over durations. :param values: shape: [B, 1, T_de] :type values: torch.Tensor :param durs: shape: [B, T_en] :type durs: torch.Tensor
- Returns:
avg – shape: [B, 1, T_en]
- Return type:
- speechbrain.lobes.models.FastSpeech2.upsample(feats, durs, pace=1.0, padding_value=0.0)[source]
upsample encoder ouput according to durations :param feats: batch of input tokens :type feats: torch.tensor :param durs: durations to be used to upsample :type durs: torch.tensor :param pace: scaling factor for durations :type pace: float :param padding_value: padding index :type padding_value: int
- Returns:
mel_post (torch.Tensor) – mel outputs from the decoder
predict_durations (torch.Tensor) – predicted durations for each token
- class speechbrain.lobes.models.FastSpeech2.TextMelCollate[source]
Bases:
object
Zero-pads model inputs and targets based on number of frames per step result: tuple
a tuple of tensors to be used as inputs/targets (
text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs
)
- class speechbrain.lobes.models.FastSpeech2.Loss(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, spn_loss_weight=1.0, spn_loss_max_epochs=8)[source]
Bases:
Module
Loss Computation :param log_scale_durations: applies logarithm to target durations :type log_scale_durations: bool :param duration_loss_weight: weight for the duration loss :type duration_loss_weight: int :param pitch_loss_weight: weight for the pitch loss :type pitch_loss_weight: int :param energy_loss_weight: weight for the energy loss :type energy_loss_weight: int :param mel_loss_weight: weight for the mel loss :type mel_loss_weight: int :param postnet_mel_loss_weight: weight for the postnet mel loss :type postnet_mel_loss_weight: int
- speechbrain.lobes.models.FastSpeech2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, min_max_energy_norm, norm, mel_scale, compression, audio)[source]
calculates MelSpectrogram for a raw audio signal :param sample_rate: Sample rate of audio signal. :type sample_rate: int :param hop_length: Length of hop between STFT windows. :type hop_length: int :param win_length: Window size. :type win_length: int :param n_fft: Size of FFT. :type n_fft: int :param n_mels: Number of mel filterbanks. :type n_mels: int :param f_min: Minimum frequency. :type f_min: float :param f_max: Maximum frequency. :type f_max: float :param power: Exponent for the magnitude spectrogram. :type power: float :param normalized: Whether to normalize by magnitude after stft. :type normalized: bool :param norm: If “slaney”, divide the triangular mel weights by the width of the mel band :type norm: str or None :param mel_scale: Scale to use: “htk” or “slaney”. :type mel_scale: str :param compression: whether to do dynamic range compression :type compression: bool :param audio: input audio signal :type audio: torch.tensor
- speechbrain.lobes.models.FastSpeech2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]
Dynamic range compression for audio signals
- class speechbrain.lobes.models.FastSpeech2.SSIMLoss[source]
Bases:
Module
SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity
- sequence_mask(sequence_length, max_len=None)[source]
Create a sequence mask for filtering padding in a sequence tensor. :param sequence_length: Sequence lengths. :type sequence_length: torch.Tensor :param max_len: Maximum sequence length. Defaults to None. :type max_len: int
- Returns:
mask
- Return type:
[B, T_max]
- sample_wise_min_max(x: Tensor, mask: Tensor)[source]
Min-Max normalize tensor through first dimension :param x: input tensor [B, D1, D2] :type x: torch.Tensor :param m: input mask [B, D1, 1] :type m: torch.Tensor
- forward(y_hat, y, length)[source]
- Parameters:
y_hat (torch.Tensor) – model prediction values [B, T, D].
y (torch.Tensor) – target values [B, T, D].
length (torch.Tensor) – length of each sample in a batch for masking.
- Returns:
loss
- Return type:
Average loss value in range [0, 1] masked by the length.