speechbrain.lobes.models.FastSpeech2 module
Neural network modules for the FastSpeech 2: Fast and High-Quality End-to-End Text to Speech synthesis model Authors * Sathvik Udupa 2022 * Pradnya Kandarkar 2023 * Yingzhi Wang 2023
Summary
Classes:
Learns the alignment between the input text and the spectrogram with Gaussian Attention. |
|
Binary loss that forces soft alignments to match the hard alignments as explained in |
|
Duration predictor layer |
|
Embedding layer for tokens |
|
The FastSpeech2 text-to-speech model. |
|
The FastSpeech2 text-to-speech model with internal alignment. |
|
CTC alignment loss |
|
Loss Computation |
|
Loss computation including internal aligner |
|
FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernel size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability for postnet :type postnet_dropout: float |
|
This module for the silent phoneme predictor. |
|
SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity |
|
Zero-pads model inputs and targets based on number of frames per step |
|
Zero-pads model inputs and targets based on number of frames per step result: tuple a tuple of tensors to be used as inputs/targets ( text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs ) |
Functions:
Average values over durations. |
|
Dynamic range compression for audio signals |
|
Monotonic alignment search algorithm, numpy works faster than the torch implementation. |
|
calculates MelSpectrogram for a raw audio signal |
|
upsample encoder output according to durations |
Reference
- class speechbrain.lobes.models.FastSpeech2.EncoderPreNet(n_vocab, blank_id, out_channels=512)[source]
Bases:
Module
Embedding layer for tokens
- Parameters:
Example
>>> from speechbrain.nnet.embedding import Embedding >>> from speechbrain.lobes.models.FastSpeech2 import EncoderPreNet >>> encoder_prenet_layer = EncoderPreNet(n_vocab=40, blank_id=0, out_channels=384) >>> x = torch.rand(3, 5) >>> y = encoder_prenet_layer(x) >>> y.shape torch.Size([3, 5, 384])
- forward(x)[source]
Computes the forward pass
- Parameters:
x (torch.Tensor) – a (batch, tokens) input tensor
- Returns:
output – the embedding layer output
- Return type:
- class speechbrain.lobes.models.FastSpeech2.PostNet(n_mel_channels=80, postnet_embedding_dim=512, postnet_kernel_size=5, postnet_n_convolutions=5, postnet_dropout=0.5)[source]
Bases:
Module
FastSpeech2 Conv Postnet :param n_mel_channels: input feature dimension for convolution layers :type n_mel_channels: int :param postnet_embedding_dim: output feature dimension for convolution layers :type postnet_embedding_dim: int :param postnet_kernel_size: postnet convolution kernel size :type postnet_kernel_size: int :param postnet_n_convolutions: number of convolution layers :type postnet_n_convolutions: int :param postnet_dropout: dropout probability for postnet :type postnet_dropout: float
- forward(x)[source]
Computes the forward pass
- Parameters:
x (torch.Tensor) – a (batch, time_steps, features) input tensor
- Returns:
output – the spectrogram predicted
- Return type:
- class speechbrain.lobes.models.FastSpeech2.DurationPredictor(in_channels, out_channels, kernel_size, dropout=0.0, n_units=1)[source]
Bases:
Module
Duration predictor layer
- Parameters:
Example
>>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2 >>> duration_predictor_layer = DurationPredictor(in_channels=384, out_channels=384, kernel_size=3) >>> x = torch.randn(3, 400, 384) >>> mask = torch.ones(3, 400, 384) >>> y = duration_predictor_layer(x, mask) >>> y.shape torch.Size([3, 400, 1])
- forward(x, x_mask)[source]
Computes the forward pass
- Parameters:
x (torch.Tensor) – a (batch, time_steps, features) input tensor
x_mask (torch.Tensor) – mask of input tensor
- Returns:
output – the duration predictor outputs
- Return type:
- class speechbrain.lobes.models.FastSpeech2.SPNPredictor(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, padding_idx)[source]
Bases:
Module
This module for the silent phoneme predictor. It receives phoneme sequences without any silent phoneme token as input and predicts whether a silent phoneme should be inserted after a position. This is to avoid the issue of fast pace at inference time due to having no silent phoneme tokens in the input sequence.
- Parameters:
enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder
enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers
enc_d_model (int) – the number of expected features in the encoder
enc_ffn_dim (int) – the dimension of the feedforward network model
enc_k_dim (int) – the dimension of the key
enc_v_dim (int) – the dimension of the value
enc_dropout (float) – Dropout for the encoder
normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type (str) – whether to use convolutional layers instead of feed forward network inside transformer layer
ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
n_char (int) – the number of symbols for the token embedding
padding_idx (int) – the index for padding
- forward(tokens, last_phonemes)[source]
forward pass for the module
- Parameters:
tokens (torch.Tensor) – input tokens without silent phonemes
last_phonemes (torch.Tensor) – indicates if a phoneme at an index is the last phoneme of a word or not
- Returns:
spn_decision – indicates if a silent phoneme should be inserted after a phoneme
- Return type:
- infer(tokens, last_phonemes)[source]
inference function
- Parameters:
tokens (torch.Tensor) – input tokens without silent phonemes
last_phonemes (torch.Tensor) – indicates if a phoneme at an index is the last phoneme of a word or not
- Returns:
spn_decision – indicates if a silent phoneme should be inserted after a phoneme
- Return type:
- class speechbrain.lobes.models.FastSpeech2.FastSpeech2(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]
Bases:
Module
The FastSpeech2 text-to-speech model. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers Simplified STRUCTURE: input->token embedding ->encoder ->duration/pitch/energy predictor ->duration upsampler -> decoder -> output During training, teacher forcing is used (ground truth durations are used for upsampling)
- Parameters:
enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder
enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers
enc_d_model (int) – the number of expected features in the encoder
enc_ffn_dim (int) – the dimension of the feedforward network model
enc_k_dim (int) – the dimension of the key
enc_v_dim (int) – the dimension of the value
enc_dropout (float) – Dropout for the encoder
dec_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in decoder
dec_num_head (int) – number of multi-head-attention (MHA) heads in decoder transformer layers
dec_d_model (int) – the number of expected features in the decoder
dec_ffn_dim (int) – the dimension of the feedforward network model
dec_k_dim (int) – the dimension of the key
dec_v_dim (int) – the dimension of the value
dec_dropout (float) – dropout for the decoder
normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type (str) – whether to use convolutional layers instead of feed forward network inside transformer layer.
ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
n_char (int) – the number of symbols for the token embedding
n_mels (int) – number of bins in mel spectrogram
postnet_embedding_dim (int) – output feature dimension for convolution layers
postnet_kernel_size (int) – postnet convolution kernel size
postnet_n_convolutions (int) – number of convolution layers
postnet_dropout (float) – dropout probability for postnet
padding_idx (int) – the index for padding
dur_pred_kernel_size (int) – the convolution kernel size in duration predictor
pitch_pred_kernel_size (int) – kernel size for pitch prediction.
energy_pred_kernel_size (int) – kernel size for energy prediction.
variance_predictor_dropout (float) – dropout probability for variance predictor (duration/pitch/energy)
Example
>>> import torch >>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2 >>> model = FastSpeech2( ... enc_num_layers=6, ... enc_num_head=2, ... enc_d_model=384, ... enc_ffn_dim=1536, ... enc_k_dim=384, ... enc_v_dim=384, ... enc_dropout=0.1, ... dec_num_layers=6, ... dec_num_head=2, ... dec_d_model=384, ... dec_ffn_dim=1536, ... dec_k_dim=384, ... dec_v_dim=384, ... dec_dropout=0.1, ... normalize_before=False, ... ffn_type='1dcnn', ... ffn_cnn_kernel_size_list=[9, 1], ... n_char=40, ... n_mels=80, ... postnet_embedding_dim=512, ... postnet_kernel_size=5, ... postnet_n_convolutions=5, ... postnet_dropout=0.5, ... padding_idx=0, ... dur_pred_kernel_size=3, ... pitch_pred_kernel_size=3, ... energy_pred_kernel_size=3, ... variance_predictor_dropout=0.5) >>> inputs = torch.tensor([ ... [13, 12, 31, 14, 19], ... [31, 16, 30, 31, 0], ... ]) >>> input_lengths = torch.tensor([5, 4]) >>> durations = torch.tensor([ ... [2, 4, 1, 5, 3], ... [1, 2, 4, 3, 0], ... ]) >>> mel_post, postnet_output, predict_durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens = model(inputs, durations=durations) >>> mel_post.shape, predict_durations.shape (torch.Size([2, 15, 80]), torch.Size([2, 5])) >>> predict_pitch.shape, predict_energy.shape (torch.Size([2, 5, 1]), torch.Size([2, 5, 1]))
- forward(tokens, durations=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
forward pass for training and inference
- Parameters:
tokens (torch.Tensor) – batch of input tokens
durations (torch.Tensor) – batch of durations for each token. If it is None, the model will infer on predicted durations
pitch (torch.Tensor) – batch of pitch for each frame. If it is None, the model will infer on predicted pitches
energy (torch.Tensor) – batch of energy for each frame. If it is None, the model will infer on predicted energies
pace (float) – scaling factor for durations
pitch_rate (float) – scaling factor for pitches
energy_rate (float) – scaling factor for energies
- Returns:
mel_post (torch.Tensor) – mel outputs from the decoder
postnet_output (torch.Tensor) – mel outputs from the postnet
predict_durations (torch.Tensor) – predicted durations of each token
predict_pitch (torch.Tensor) – predicted pitches of each token
avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None
predict_energy (torch.Tensor) – predicted energies of each token
avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None
mel_length – predicted lengths of mel spectrograms
- speechbrain.lobes.models.FastSpeech2.average_over_durations(values, durs)[source]
Average values over durations.
- Parameters:
values (torch.Tensor) – shape: [B, 1, T_de]
durs (torch.Tensor) – shape: [B, T_en]
- Returns:
avg – shape: [B, 1, T_en]
- Return type:
- speechbrain.lobes.models.FastSpeech2.upsample(feats, durs, pace=1.0, padding_value=0.0)[source]
upsample encoder output according to durations
- Parameters:
feats (torch.Tensor) – batch of input tokens
durs (torch.Tensor) – durations to be used to upsample
pace (float) – scaling factor for durations
padding_value (int) – padding index
- Returns:
mel_post (torch.Tensor) – mel outputs from the decoder
predict_durations (torch.Tensor) – predicted durations for each token
- class speechbrain.lobes.models.FastSpeech2.TextMelCollate[source]
Bases:
object
Zero-pads model inputs and targets based on number of frames per step
- __call__(batch)[source]
Collate’s training batch from normalized text and mel-spectrogram
- Parameters:
batch (list) – [text_normalized, mel_normalized]
- Returns:
text_padded (torch.Tensor)
dur_padded (torch.Tensor)
input_lengths (torch.Tensor)
mel_padded (torch.Tensor)
pitch_padded (torch.Tensor)
energy_padded (torch.Tensor)
output_lengths (torch.Tensor)
len_x (torch.Tensor)
labels (torch.Tensor)
wavs (torch.Tensor)
no_spn_seq_padded (torch.Tensor)
spn_labels_padded (torch.Tensor)
last_phonemes_padded (torch.Tensor)
- class speechbrain.lobes.models.FastSpeech2.Loss(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, spn_loss_weight=1.0, spn_loss_max_epochs=8)[source]
Bases:
Module
Loss Computation
- Parameters:
log_scale_durations (bool) – applies logarithm to target durations
ssim_loss_weight (float) – weight for ssim loss
duration_loss_weight (float) – weight for the duration loss
pitch_loss_weight (float) – weight for the pitch loss
energy_loss_weight (float) – weight for the energy loss
mel_loss_weight (float) – weight for the mel loss
postnet_mel_loss_weight (float) – weight for the postnet mel loss
spn_loss_weight (float) – weight for spn loss
spn_loss_max_epochs (int) – Max number of epochs
- speechbrain.lobes.models.FastSpeech2.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, min_max_energy_norm, norm, mel_scale, compression, audio)[source]
calculates MelSpectrogram for a raw audio signal
- Parameters:
sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
min_max_energy_norm (bool) – Whether to normalize by min-max
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
compression (bool) – whether to do dynamic range compression
audio (torch.Tensor) – input audio signal
- Returns:
mel (torch.Tensor)
rmse (torch.Tensor)
- speechbrain.lobes.models.FastSpeech2.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]
Dynamic range compression for audio signals
- class speechbrain.lobes.models.FastSpeech2.SSIMLoss[source]
Bases:
Module
SSIM loss as (1 - SSIM) SSIM is explained here https://en.wikipedia.org/wiki/Structural_similarity
- sequence_mask(sequence_length, max_len=None)[source]
Create a sequence mask for filtering padding in a sequence tensor.
- Parameters:
sequence_length (torch.Tensor) – Sequence lengths.
max_len (int) – Maximum sequence length. Defaults to None.
- Returns:
mask
- Return type:
[B, T_max]
- sample_wise_min_max(x: Tensor, mask: Tensor)[source]
Min-Max normalize tensor through first dimension
- Parameters:
x (torch.Tensor) – input tensor [B, D1, D2]
mask (torch.Tensor) – input mask [B, D1, 1]
- Return type:
Normalized tensor
- forward(y_hat, y, length)[source]
- Parameters:
y_hat (torch.Tensor) – model prediction values [B, T, D].
y (torch.Tensor) – target values [B, T, D].
length (torch.Tensor) – length of each sample in a batch for masking.
- Returns:
loss
- Return type:
Average loss value in range [0, 1] masked by the length.
- class speechbrain.lobes.models.FastSpeech2.TextMelCollateWithAlignment[source]
Bases:
object
Zero-pads model inputs and targets based on number of frames per step result: tuple
a tuple of tensors to be used as inputs/targets (
text_padded, dur_padded, input_lengths, mel_padded, output_lengths, len_x, labels, wavs
)
- __call__(batch)[source]
Collate’s training batch from normalized text and mel-spectrogram
- Parameters:
batch (list) – [text_normalized, mel_normalized]
- Returns:
phoneme_padded (torch.Tensor)
input_lengths (torch.Tensor)
mel_padded (torch.Tensor)
pitch_padded (torch.Tensor)
energy_padded (torch.Tensor)
output_lengths (torch.Tensor)
labels (torch.Tensor)
wavs (torch.Tensor)
- speechbrain.lobes.models.FastSpeech2.maximum_path_numpy(value, mask)[source]
Monotonic alignment search algorithm, numpy works faster than the torch implementation.
- Parameters:
value (torch.Tensor) – input alignment values [b, t_x, t_y]
mask (torch.Tensor) – input alignment mask [b, t_x, t_y]
- Returns:
path
- Return type:
Example
>>> import torch >>> from speechbrain.lobes.models.FastSpeech2 import maximum_path_numpy >>> alignment = torch.rand(2, 5, 100) >>> mask = torch.ones(2, 5, 100) >>> hard_alignments = maximum_path_numpy(alignment, mask)
- class speechbrain.lobes.models.FastSpeech2.AlignmentNetwork(in_query_channels=80, in_key_channels=512, attn_channels=80, temperature=0.0005)[source]
Bases:
Module
Learns the alignment between the input text and the spectrogram with Gaussian Attention.
query -> conv1d -> relu -> conv1d -> relu -> conv1d -> L2_dist -> softmax -> alignment key -> conv1d -> relu -> conv1d - - - - - - - - - - - -^
- Parameters:
in_query_channels (int) – Number of channels in the query network. Defaults to 80.
in_key_channels (int) – Number of channels in the key network. Defaults to 512.
attn_channels (int) – Number of inner channels in the attention layers. Defaults to 80.
temperature (float) – Temperature for the softmax. Defaults to 0.0005.
Example
>>> import torch >>> from speechbrain.lobes.models.FastSpeech2 import AlignmentNetwork >>> aligner = AlignmentNetwork( ... in_query_channels=80, ... in_key_channels=512, ... attn_channels=80, ... temperature=0.0005, ... ) >>> phoneme_feats = torch.rand(2, 512, 20) >>> mels = torch.rand(2, 80, 100) >>> alignment_soft, alignment_logprob = aligner(mels, phoneme_feats, None, None) >>> alignment_soft.shape, alignment_logprob.shape (torch.Size([2, 1, 100, 20]), torch.Size([2, 1, 100, 20]))
- forward(queries, keys, mask, attn_prior)[source]
Forward pass of the aligner encoder.
- Parameters:
queries (torch.Tensor) – the query tensor [B, C, T_de]
keys (torch.Tensor) – the query tensor [B, C_emb, T_en]
mask (torch.Tensor) – the query mask[B, T_de]
attn_prior (torch.Tensor) – the prior attention tensor [B, 1, T_en, T_de]
- Returns:
attn (torch.Tensor) – soft attention [B, 1, T_en, T_de]
attn_logp (torch.Tensor) – log probabilities [B, 1, T_en , T_de]
- class speechbrain.lobes.models.FastSpeech2.FastSpeech2WithAlignment(enc_num_layers, enc_num_head, enc_d_model, enc_ffn_dim, enc_k_dim, enc_v_dim, enc_dropout, in_query_channels, in_key_channels, attn_channels, temperature, dec_num_layers, dec_num_head, dec_d_model, dec_ffn_dim, dec_k_dim, dec_v_dim, dec_dropout, normalize_before, ffn_type, ffn_cnn_kernel_size_list, n_char, n_mels, postnet_embedding_dim, postnet_kernel_size, postnet_n_convolutions, postnet_dropout, padding_idx, dur_pred_kernel_size, pitch_pred_kernel_size, energy_pred_kernel_size, variance_predictor_dropout)[source]
Bases:
Module
The FastSpeech2 text-to-speech model with internal alignment. This class is the main entry point for the model, which is responsible for instantiating all submodules, which, in turn, manage the individual neural network layers. Certain parts are adopted from the following implementation: https://github.com/coqui-ai/TTS/blob/dev/TTS/tts/models/forward_tts.py
Simplified STRUCTURE: input -> token embedding -> encoder -> aligner -> duration/pitch/energy -> upsampler -> decoder -> output
- Parameters:
enc_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in encoder
enc_num_head (int) – number of multi-head-attention (MHA) heads in encoder transformer layers
enc_d_model (int) – the number of expected features in the encoder
enc_ffn_dim (int) – the dimension of the feedforward network model
enc_k_dim (int) – the dimension of the key
enc_v_dim (int) – the dimension of the value
enc_dropout (float) – Dropout for the encoder
in_query_channels (int) – Number of channels in the query network.
in_key_channels (int) – Number of channels in the key network.
attn_channels (int) – Number of inner channels in the attention layers.
temperature (float) – Temperature for the softmax.
dec_num_layers (int) – number of transformer layers (TransformerEncoderLayer) in decoder
dec_num_head (int) – number of multi-head-attention (MHA) heads in decoder transformer layers
dec_d_model (int) – the number of expected features in the decoder
dec_ffn_dim (int) – the dimension of the feedforward network model
dec_k_dim (int) – the dimension of the key
dec_v_dim (int) – the dimension of the value
dec_dropout (float) – dropout for the decoder
normalize_before (bool) – whether normalization should be applied before or after MHA or FFN in Transformer layers.
ffn_type (str) – whether to use convolutional layers instead of feed forward network inside transformer layer.
ffn_cnn_kernel_size_list (list of int) – conv kernel size of 2 1d-convs if ffn_type is 1dcnn
n_char (int) – the number of symbols for the token embedding
n_mels (int) – number of bins in mel spectrogram
postnet_embedding_dim (int) – output feature dimension for convolution layers
postnet_kernel_size (int) – postnet convolution kernel size
postnet_n_convolutions (int) – number of convolution layers
postnet_dropout (float) – dropout probability for postnet
padding_idx (int) – the index for padding
dur_pred_kernel_size (int) – the convolution kernel size in duration predictor
pitch_pred_kernel_size (int) – kernel size for pitch prediction.
energy_pred_kernel_size (int) – kernel size for energy prediction.
variance_predictor_dropout (float) – dropout probability for variance predictor (duration/pitch/energy)
Example
>>> import torch >>> from speechbrain.lobes.models.FastSpeech2 import FastSpeech2WithAlignment >>> model = FastSpeech2WithAlignment( ... enc_num_layers=6, ... enc_num_head=2, ... enc_d_model=384, ... enc_ffn_dim=1536, ... enc_k_dim=384, ... enc_v_dim=384, ... enc_dropout=0.1, ... in_query_channels=80, ... in_key_channels=384, ... attn_channels=80, ... temperature=0.0005, ... dec_num_layers=6, ... dec_num_head=2, ... dec_d_model=384, ... dec_ffn_dim=1536, ... dec_k_dim=384, ... dec_v_dim=384, ... dec_dropout=0.1, ... normalize_before=False, ... ffn_type='1dcnn', ... ffn_cnn_kernel_size_list=[9, 1], ... n_char=40, ... n_mels=80, ... postnet_embedding_dim=512, ... postnet_kernel_size=5, ... postnet_n_convolutions=5, ... postnet_dropout=0.5, ... padding_idx=0, ... dur_pred_kernel_size=3, ... pitch_pred_kernel_size=3, ... energy_pred_kernel_size=3, ... variance_predictor_dropout=0.5) >>> inputs = torch.tensor([ ... [13, 12, 31, 14, 19], ... [31, 16, 30, 31, 0], ... ]) >>> mels = torch.rand(2, 100, 80) >>> mel_post, postnet_output, durations, predict_pitch, avg_pitch, predict_energy, avg_energy, mel_lens, alignment_durations, alignment_soft, alignment_logprob, alignment_mas = model(inputs, mels) >>> mel_post.shape, durations.shape (torch.Size([2, 100, 80]), torch.Size([2, 5])) >>> predict_pitch.shape, predict_energy.shape (torch.Size([2, 5, 1]), torch.Size([2, 5, 1])) >>> alignment_soft.shape, alignment_mas.shape (torch.Size([2, 100, 5]), torch.Size([2, 100, 5]))
- forward(tokens, mel_spectograms=None, pitch=None, energy=None, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
forward pass for training and inference
- Parameters:
tokens (torch.Tensor) – batch of input tokens
mel_spectograms (torch.Tensor) – batch of mel_spectograms (used only for training)
pitch (torch.Tensor) – batch of pitch for each frame. If it is None, the model will infer on predicted pitches
energy (torch.Tensor) – batch of energy for each frame. If it is None, the model will infer on predicted energies
pace (float) – scaling factor for durations
pitch_rate (float) – scaling factor for pitches
energy_rate (float) – scaling factor for energies
- Returns:
mel_post (torch.Tensor) – mel outputs from the decoder
postnet_output (torch.Tensor) – mel outputs from the postnet
predict_durations (torch.Tensor) – predicted durations of each token
predict_pitch (torch.Tensor) – predicted pitches of each token
avg_pitch (torch.Tensor) – target pitches for each token if input pitch is not None None if input pitch is None
predict_energy (torch.Tensor) – predicted energies of each token
avg_energy (torch.Tensor) – target energies for each token if input energy is not None None if input energy is None
mel_length – predicted lengths of mel spectrograms
alignment_durations – durations from the hard alignment map
alignment_soft (torch.Tensor) – soft alignment potentials
alignment_logprob (torch.Tensor) – log scale alignment potentials
alignment_mas (torch.Tensor) – hard alignment map
- class speechbrain.lobes.models.FastSpeech2.LossWithAlignment(log_scale_durations, ssim_loss_weight, duration_loss_weight, pitch_loss_weight, energy_loss_weight, mel_loss_weight, postnet_mel_loss_weight, aligner_loss_weight, binary_alignment_loss_weight, binary_alignment_loss_warmup_epochs, binary_alignment_loss_max_epochs)[source]
Bases:
Module
Loss computation including internal aligner
- Parameters:
log_scale_durations (bool) – applies logarithm to target durations
ssim_loss_weight (float) – weight for the ssim loss
duration_loss_weight (float) – weight for the duration loss
pitch_loss_weight (float) – weight for the pitch loss
energy_loss_weight (float) – weight for the energy loss
mel_loss_weight (float) – weight for the mel loss
postnet_mel_loss_weight (float) – weight for the postnet mel loss
aligner_loss_weight (float) – weight for the alignment loss
binary_alignment_loss_weight (float) – weight for the postnet mel loss
binary_alignment_loss_warmup_epochs (int) – Number of epochs to gradually increase the impact of binary loss.
binary_alignment_loss_max_epochs (int) – From this epoch on the impact of binary loss is ignored.
- class speechbrain.lobes.models.FastSpeech2.ForwardSumLoss(blank_logprob=-1)[source]
Bases:
Module
CTC alignment loss
- Parameters:
blank_logprob (pad value)
Example
>>> import torch >>> from speechbrain.lobes.models.FastSpeech2 import ForwardSumLoss >>> loss_func = ForwardSumLoss() >>> attn_logprob = torch.rand(2, 1, 100, 5) >>> key_lens = torch.tensor([5, 5]) >>> query_lens = torch.tensor([100, 100]) >>> loss = loss_func(attn_logprob, key_lens, query_lens)
- forward(attn_logprob, key_lens, query_lens)[source]
- Parameters:
attn_logprob (torch.Tensor) – log scale alignment potentials [B, 1, query_lens, key_lens]
key_lens (torch.Tensor) – mel lengths
query_lens (torch.Tensor) – phoneme lengths
- Returns:
total_loss
- Return type:
- class speechbrain.lobes.models.FastSpeech2.BinaryAlignmentLoss[source]
Bases:
Module
Binary loss that forces soft alignments to match the hard alignments as explained in
https://arxiv.org/pdf/2108.10447.pdf
. .. rubric:: Example>>> import torch >>> from speechbrain.lobes.models.FastSpeech2 import BinaryAlignmentLoss >>> loss_func = BinaryAlignmentLoss() >>> alignment_hard = torch.randint(0, 2, (2, 100, 5)) >>> alignment_soft = torch.rand(2, 100, 5) >>> loss = loss_func(alignment_hard, alignment_soft)