speechbrain.lobes.models.HifiGAN moduleο
Neural network modules for the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
For more details: https://arxiv.org/pdf/2010.05646.pdf, https://arxiv.org/abs/2406.10735
- Authors
Jarod Duret 2021
Yingzhi WANG 2022
Summaryο
Classes:
Creates a summary of discriminator losses |
|
HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applies a stack of convolutions. Note: if period is 2 waveform = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> convs -> score, feat. |
|
HiFiGAN Scale Discriminator. |
|
Creates a summary of generator losses and applies weights for different losses |
|
HiFiGAN discriminator wrapping MPD and MSD. |
|
HiFiGAN Generator with Multi-Receptive Field Fusion (MRF) |
|
Hinge Discriminator Loss. |
|
Hinge Generator Loss. |
|
L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss |
|
Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0. |
|
Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1. |
|
Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019). |
|
HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the |
|
HiFiGAN Multi-Scale Discriminator. |
|
Multi-scale STFT loss. |
|
Residual Block Type 1, which has 3 convolutional layers in each convolution block. |
|
Residual Block Type 2, which has 2 convolutional layers in each convolution block. |
|
STFT loss. |
|
The UnitHiFiGAN generator takes discrete speech tokens as input. |
|
Variance predictor inspired from FastSpeech2 |
Functions:
Dynamique range compression for audio signals |
|
calculates MelSpectrogram for a raw audio signal |
|
Process a given batch of code to extract consecutive unique elements and their associated features. |
|
computes the Fourier transform of short overlapping windows of the input |
Referenceο
- speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]ο
Dynamique range compression for audio signals
- speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]ο
calculates MelSpectrogram for a raw audio signal
- Parameters:
sample_rate (int) β Sample rate of audio signal.
hop_length (int) β Length of hop between STFT windows.
win_length (int) β Window size.
n_fft (int) β Size of FFT.
n_mels (int) β Number of mel filterbanks.
f_min (float) β Minimum frequency.
f_max (float) β Maximum frequency.
power (float) β Exponent for the magnitude spectrogram.
normalized (bool) β Whether to normalize by magnitude after stft.
norm (str or None) β If βslaneyβ, divide the triangular mel weights by the width of the mel band
mel_scale (str) β Scale to use: βhtkβ or βslaneyβ.
compression (bool) β whether to do dynamic range compression
audio (torch.tensor) β input audio signal
- Return type:
Mel spectrogram
- speechbrain.lobes.models.HifiGAN.process_duration(code, code_feat)[source]ο
Process a given batch of code to extract consecutive unique elements and their associated features.
- Parameters:
code (torch.Tensor (batch, time)) β Tensor of code indices.
code_feat (torch.Tensor (batch, time, channel)) β Tensor of code features.
- Returns:
uniq_code_feat_filtered (torch.Tensor (batch, time)) β Features of consecutive unique codes.
mask (torch.Tensor (batch, time)) β Padding mask for the unique codes.
uniq_code_count (torch.Tensor (n)) β Count of unique codes.
Example
>>> code = torch.IntTensor([[40, 18, 18, 10]]) >>> code_feat = torch.rand([1, 4, 128]) >>> out_tensor, mask, uniq_code = process_duration(code, code_feat) >>> out_tensor.shape torch.Size([1, 1, 128]) >>> mask.shape torch.Size([1, 1]) >>> uniq_code.shape torch.Size([1])
- class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]ο
Bases:
ModuleResidual Block Type 1, which has 3 convolutional layers in each convolution block.
- Parameters:
- forward(x)[source]ο
Returns the output of ResBlock1
- Parameters:
x (torch.Tensor (batch, channel, time)) β input tensor.
- Return type:
The ResBlock outputs
- class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]ο
Bases:
ModuleResidual Block Type 2, which has 2 convolutional layers in each convolution block.
- Parameters:
- forward(x)[source]ο
Returns the output of ResBlock1
- Parameters:
x (torch.Tensor (batch, channel, time)) β input tensor.
- Return type:
The ResBlock outputs
- class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]ο
Bases:
ModuleHiFiGAN Generator with Multi-Receptive Field Fusion (MRF)
- Parameters:
in_channels (int) β number of input tensor channels.
out_channels (int) β number of output tensor channels.
resblock_type (str) β type of the
ResBlock. β1β or β2β.resblock_dilation_sizes (List[List[int]]) β list of dilation values in each layer of a
ResBlock.resblock_kernel_sizes (List[int]) β list of kernel sizes for each
ResBlock.upsample_kernel_sizes (List[int]) β list of kernel sizes for each transposed convolution.
upsample_initial_channel (int) β number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.
upsample_factors (List[int]) β upsampling factors (stride) for each upsampling layer.
inference_padding (int) β constant padding applied to the input at inference time. Defaults to 5.
cond_channels (int) β If provided, adds a conv layer to the beginning of the forward.
conv_post_bias (bool) β Whether to add a bias term to the final conv.
Example
>>> inp_tensor = torch.rand([4, 80, 33]) >>> hifigan_generator = HifiganGenerator( ... in_channels=80, ... out_channels=1, ... resblock_type="1", ... resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]], ... resblock_kernel_sizes=[3, 7, 11], ... upsample_kernel_sizes=[16, 16, 4, 4], ... upsample_initial_channel=512, ... upsample_factors=[8, 8, 2, 2], ... ) >>> out_tensor = hifigan_generator(inp_tensor) >>> out_tensor.shape torch.Size([4, 1, 8448])
- forward(x, g=None)[source]ο
- Parameters:
x (torch.Tensor (batch, channel, time)) β feature input tensor.
g (torch.Tensor (batch, 1, time)) β global conditioning input tensor.
- Return type:
The generator outputs
- inference(c, padding=True)[source]ο
The inference function performs a padding and runs the forward method.
- Parameters:
c (torch.Tensor (batch, channel, time)) β feature input tensor.
padding (bool) β Whether to pad tensor before forward.
- Return type:
The generator outputs
- class speechbrain.lobes.models.HifiGAN.VariancePredictor(encoder_embed_dim, var_pred_hidden_dim, var_pred_kernel_size, var_pred_dropout)[source]ο
Bases:
ModuleVariance predictor inspired from FastSpeech2
- Parameters:
Example
>>> inp_tensor = torch.rand([4, 80, 128]) >>> duration_predictor = VariancePredictor( ... encoder_embed_dim=128, ... var_pred_hidden_dim=128, ... var_pred_kernel_size=3, ... var_pred_dropout=0.5, ... ) >>> out_tensor = duration_predictor(inp_tensor) >>> out_tensor.shape torch.Size([4, 80])
- forward(x)[source]ο
- Parameters:
x (torch.Tensor (batch, channel, time)) β feature input tensor.
- Return type:
Variance predictor output
- class speechbrain.lobes.models.HifiGAN.UnitHifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True, vocab_size=100, embedding_dim=128, attn_dim=128, duration_predictor=False, var_pred_hidden_dim=128, var_pred_kernel_size=3, var_pred_dropout=0.5, multi_speaker=False, normalize_speaker_embeddings=False, skip_token_embedding=False, pooling_type='attention')[source]ο
Bases:
HifiganGeneratorThe UnitHiFiGAN generator takes discrete speech tokens as input. The generator is adapted to support bitrate scalability training. For more details, refer to: https://arxiv.org/abs/2406.10735.
- Parameters:
in_channels (int) β number of input tensor channels.
out_channels (int) β number of output tensor channels.
resblock_type (str) β type of the
ResBlock. β1β or β2β.resblock_dilation_sizes (List[List[int]]) β list of dilation values in each layer of a
ResBlock.resblock_kernel_sizes (List[int]) β list of kernel sizes for each
ResBlock.upsample_kernel_sizes (List[int]) β list of kernel sizes for each transposed convolution.
upsample_initial_channel (int) β number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.
upsample_factors (List[int]) β upsampling factors (stride) for each upsampling layer.
inference_padding (int) β constant padding applied to the input at inference time. Defaults to 5.
cond_channels (int) β Whether to add a conv to the front
conv_post_bias (bool) β Whether to add a bias to the last conv
vocab_size (int) β size of the dictionary of embeddings.
embedding_dim (int) β size of each embedding vector.
attn_dim (int) β size of attention dimension.
duration_predictor (bool) β enable duration predictor module.
var_pred_hidden_dim (int) β size of hidden channels for the convolutional layers of the duration predictor.
var_pred_kernel_size (int) β size of the convolution filter in each layer of the duration predictor.
var_pred_dropout (float) β dropout probability of each layer in the duration predictor.
multi_speaker (bool) β enable multi speaker training.
normalize_speaker_embeddings (bool) β enable normalization of speaker embeddings.
skip_token_embedding (bool) β Whether to skip the embedding layer in the case of continuous input.
pooling_type (str, optional) β The type of pooling to use. Must be one of [βattentionβ, βsumβ, βnoneβ]. Defaults to βattentionβ for scalable vocoder.
Example
>>> inp_tensor = torch.randint(0, 100, (4, 10, 1)) >>> unit_hifigan_generator = UnitHifiganGenerator( ... in_channels=128, ... out_channels=1, ... resblock_type="1", ... resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]], ... resblock_kernel_sizes=[3, 7, 11], ... upsample_kernel_sizes=[11, 8, 8, 4, 4], ... upsample_initial_channel=512, ... upsample_factors=[5, 4, 4, 2, 2], ... vocab_size=100, ... embedding_dim=128, ... duration_predictor=True, ... var_pred_hidden_dim=128, ... var_pred_kernel_size=3, ... var_pred_dropout=0.5, ... ) >>> out_tensor, _ = unit_hifigan_generator(inp_tensor) >>> out_tensor.shape torch.Size([4, 1, 3200])
- forward(x, g=None, spk=None)[source]ο
- Parameters:
x (torch.Tensor (batch, time, channel)) β feature input tensor.
g (torch.Tensor (batch, 1, time)) β global conditioning input tensor.
spk (torch.Tensor) β Speaker embeddings
- Return type:
Generator output
- inference(x, spk=None)[source]ο
The inference function performs duration prediction and runs the forward method.
- Parameters:
x (torch.Tensor (batch, time, channel)) β feature input tensor.
spk (torch.Tensor) β Speaker embeddings
- Return type:
Generator output
- class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]ο
Bases:
ModuleHiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applies a stack of convolutions. Note:
if period is 2 waveform = [1, 2, 3, 4, 5, 6 β¦] β> [1, 3, 5 β¦ ] β> convs -> score, feat
- Parameters:
- forward(x)[source]ο
- Parameters:
x (torch.Tensor (batch, 1, time)) β input waveform.
- Return type:
Scores and features
- class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]ο
Bases:
ModuleHiFiGAN Multi-Period Discriminator (MPD) Wrapper for the
PeriodDiscriminatorto apply it in different periods. Periods are suggested to be prime numbers to reduce the overlap between each discriminator.- forward(x)[source]ο
Returns Multi-Period Discriminator scores and features
- Parameters:
x (torch.Tensor (batch, 1, time)) β input waveform.
- Return type:
Scores and features
- class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]ο
Bases:
ModuleHiFiGAN Scale Discriminator. It is similar to
MelganDiscriminatorbut with a specific architecture explained in the paper. SpeechBrain CNN wrappers are not used here because spectral_norm is not often used- Parameters:
use_spectral_norm (bool) β if
Trueswitch to spectral norm instead of weight norm.
- forward(x)[source]ο
- Parameters:
x (torch.Tensor (batch, 1, time)) β input waveform.
- Return type:
Scores and features
- class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]ο
Bases:
ModuleHiFiGAN Multi-Scale Discriminator. Similar to MultiScaleMelganDiscriminator but specially tailored for HiFiGAN as in the paper.
- forward(x)[source]ο
- Parameters:
x (torch.Tensor (batch, 1, time)) β input waveform.
- Return type:
Scores and features
- class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]ο
Bases:
ModuleHiFiGAN discriminator wrapping MPD and MSD.
Example
>>> inp_tensor = torch.rand([4, 1, 8192]) >>> hifigan_discriminator = HifiganDiscriminator() >>> scores, feats = hifigan_discriminator(inp_tensor) >>> len(scores) 8 >>> len(feats) 8
- forward(x)[source]ο
Returns list of list of features from each layer of each discriminator.
- Parameters:
x (torch.Tensor) β input waveform.
- Return type:
Features from each discriminator layer
- speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]ο
computes the Fourier transform of short overlapping windows of the input
- class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]ο
Bases:
ModuleSTFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf
- Parameters:
- class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]ο
Bases:
ModuleMulti-scale STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf
- class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]ο
Bases:
ModuleL1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss
- Parameters:
sample_rate (int) β Sample rate of audio signal.
hop_length (int) β Length of hop between STFT windows.
win_length (int) β Window size.
n_mel_channels (int) β Number of mel filterbanks.
n_fft (int) β Size of FFT.
n_stft (int) β Size of STFT.
mel_fmin (float) β Minimum frequency.
mel_fmax (float) β Maximum frequency.
mel_normalized (bool) β Whether to normalize by magnitude after stft.
power (float) β Exponent for the magnitude spectrogram.
norm (str or None) β If βslaneyβ, divide the triangular mel weights by the width of the mel band
mel_scale (str) β Scale to use: βhtkβ or βslaneyβ.
dynamic_range_compression (bool) β whether to do dynamic range compression
- class speechbrain.lobes.models.HifiGAN.MSEGLoss(*args, **kwargs)[source]ο
Bases:
ModuleMean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.
- class speechbrain.lobes.models.HifiGAN.HingeGLoss(*args, **kwargs)[source]ο
Bases:
ModuleHinge Generator Loss.
The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.
Example
> import torch > score_fake = torch.randn(4, 88) > loss = HingeGLoss()(score_fake) > print(loss)
- forward(score_fake)[source]ο
Returns Generator GAN loss
- Parameters:
score_fake (torch.Tensor) β Discriminator scores of generated waveforms D(G(s))
- Return type:
Generator loss
- class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]ο
Bases:
ModuleCalculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).
- class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]ο
Bases:
ModuleMean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.
- class speechbrain.lobes.models.HifiGAN.HingeDLoss(*args, **kwargs)[source]ο
Bases:
ModuleHinge Discriminator Loss.
The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.
Example
> import torch > score_fake = torch.randn(4, 88) > score_real = torch.randn(4, 88) > loss = HingeDLoss()(score_fake, score_real) > print(loss)
- forward(score_fake, score_real)[source]ο
Returns Discriminator GAN losses
- Parameters:
score_fake (torch.Tensor) β discriminator scores of generated waveforms
score_real (torch.Tensor) β discriminator scores of groundtruth waveforms
- Return type:
Discriminator losses
- class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0, mseg_dur_loss=None, mseg_dur_loss_weight=0)[source]ο
Bases:
ModuleCreates a summary of generator losses and applies weights for different losses
- Parameters:
stft_loss (object) β object of stft loss
stft_loss_weight (float) β weight of STFT loss
mseg_loss (object) β object of mseg loss
mseg_loss_weight (float) β weight of mseg loss
feat_match_loss (object) β object of feature match loss
feat_match_loss_weight (float) β weight of feature match loss
l1_spec_loss (object) β object of L1 spectrogram loss
l1_spec_loss_weight (float) β weight of L1 spectrogram loss
mseg_dur_loss (object) β object of mseg duration loss
mseg_dur_loss_weight (float) β weight of mseg duration loss
- forward(stage, y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None, log_dur_pred=None, log_dur=None)[source]ο
Returns a dictionary of generator losses and applies weights
- Parameters:
stage (speechbrain.Stage) β training, validation or testing
y_hat (torch.tensor) β generated waveform tensor
y (torch.tensor) β real waveform tensor
scores_fake (list) β discriminator scores of generated waveforms
feats_fake (list) β discriminator features of generated waveforms
feats_real (list) β discriminator features of groundtruth waveforms
log_dur_pred (torch.Tensor) β Predicted duration for duration loss
log_dur (torch.Tensor) β Real duration for duration loss
- Return type:
Dictionary of generator losses