speechbrain.lobes.models.HifiGAN module

Neural network modules for the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

For more details: https://arxiv.org/pdf/2010.05646.pdf, https://arxiv.org/abs/2406.10735

Authors
  • Jarod Duret 2021

  • Yingzhi WANG 2022

Summary

Classes:

DiscriminatorLoss

Creates a summary of discriminator losses

DiscriminatorP

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applies a stack of convolutions. Note: if period is 2 waveform = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> convs -> score, feat.

DiscriminatorS

HiFiGAN Scale Discriminator.

GeneratorLoss

Creates a summary of generator losses and applies weights for different losses

HifiganDiscriminator

HiFiGAN discriminator wrapping MPD and MSD.

HifiganGenerator

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

HingeDLoss

Hinge Discriminator Loss.

HingeGLoss

Hinge Generator Loss.

L1SpecLoss

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

MSEDLoss

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

MSEGLoss

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

MelganFeatureLoss

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

MultiPeriodDiscriminator

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods.

MultiScaleDiscriminator

HiFiGAN Multi-Scale Discriminator.

MultiScaleSTFTLoss

Multi-scale STFT loss.

ResBlock1

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

ResBlock2

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

STFTLoss

STFT loss.

UnitHifiganGenerator

The UnitHiFiGAN generator takes discrete speech tokens as input.

VariancePredictor

Variance predictor inspired from FastSpeech2

Functions:

dynamic_range_compression

Dynamique range compression for audio signals

mel_spectogram

calculates MelSpectrogram for a raw audio signal

process_duration

Process a given batch of code to extract consecutive unique elements and their associated features.

stft

computes the Fourier transform of short overlapping windows of the input

Reference

speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]

Dynamique range compression for audio signals

speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters:
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • norm (str or None) – If β€œslaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: β€œhtk” or β€œslaney”.

  • compression (bool) – whether to do dynamic range compression

  • audio (torch.tensor) – input audio signal

Return type:

Mel spectrogram

speechbrain.lobes.models.HifiGAN.process_duration(code, code_feat)[source]

Process a given batch of code to extract consecutive unique elements and their associated features.

Parameters:
  • code (torch.Tensor (batch, time)) – Tensor of code indices.

  • code_feat (torch.Tensor (batch, time, channel)) – Tensor of code features.

Returns:

  • uniq_code_feat_filtered (torch.Tensor (batch, time)) – Features of consecutive unique codes.

  • mask (torch.Tensor (batch, time)) – Padding mask for the unique codes.

  • uniq_code_count (torch.Tensor (n)) – Count of unique codes.

Example

>>> code = torch.IntTensor([[40, 18, 18, 10]])
>>> code_feat = torch.rand([1, 4, 128])
>>> out_tensor, mask, uniq_code = process_duration(code, code_feat)
>>> out_tensor.shape
torch.Size([1, 1, 128])
>>> mask.shape
torch.Size([1, 1])
>>> uniq_code.shape
torch.Size([1])
class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]

Bases: Module

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

Parameters:
  • channels (int) – number of hidden channels for the convolutional layers.

  • kernel_size (int) – size of the convolution filter in each layer.

  • dilation (list) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters:

x (torch.Tensor (batch, channel, time)) – input tensor.

Return type:

The ResBlock outputs

remove_weight_norm()[source]

This functions removes weight normalization during inference.

class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]

Bases: Module

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

Parameters:
  • channels (int) – number of hidden channels for the convolutional layers.

  • kernel_size (int) – size of the convolution filter in each layer.

  • dilation (list) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters:

x (torch.Tensor (batch, channel, time)) – input tensor.

Return type:

The ResBlock outputs

remove_weight_norm()[source]

This functions removes weight normalization during inference.

class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]

Bases: Module

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

Parameters:
  • in_channels (int) – number of input tensor channels.

  • out_channels (int) – number of output tensor channels.

  • resblock_type (str) – type of the ResBlock. β€˜1’ or β€˜2’.

  • resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.

  • resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.

  • upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.

  • upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.

  • upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.

  • inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.

  • cond_channels (int) – If provided, adds a conv layer to the beginning of the forward.

  • conv_post_bias (bool) – Whether to add a bias term to the final conv.

Example

>>> inp_tensor = torch.rand([4, 80, 33])
>>> hifigan_generator = HifiganGenerator(
...     in_channels=80,
...     out_channels=1,
...     resblock_type="1",
...     resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...     resblock_kernel_sizes=[3, 7, 11],
...     upsample_kernel_sizes=[16, 16, 4, 4],
...     upsample_initial_channel=512,
...     upsample_factors=[8, 8, 2, 2],
... )
>>> out_tensor = hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 8448])
forward(x, g=None)[source]
Parameters:
  • x (torch.Tensor (batch, channel, time)) – feature input tensor.

  • g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.

Return type:

The generator outputs

remove_weight_norm()[source]

This functions removes weight normalization during inference.

inference(c, padding=True)[source]

The inference function performs a padding and runs the forward method.

Parameters:
  • c (torch.Tensor (batch, channel, time)) – feature input tensor.

  • padding (bool) – Whether to pad tensor before forward.

Return type:

The generator outputs

class speechbrain.lobes.models.HifiGAN.VariancePredictor(encoder_embed_dim, var_pred_hidden_dim, var_pred_kernel_size, var_pred_dropout)[source]

Bases: Module

Variance predictor inspired from FastSpeech2

Parameters:
  • encoder_embed_dim (int) – number of input tensor channels.

  • var_pred_hidden_dim (int) – size of hidden channels for the convolutional layers.

  • var_pred_kernel_size (int) – size of the convolution filter in each layer.

  • var_pred_dropout (float) – dropout probability of each layer.

Example

>>> inp_tensor = torch.rand([4, 80, 128])
>>> duration_predictor = VariancePredictor(
...     encoder_embed_dim=128,
...     var_pred_hidden_dim=128,
...     var_pred_kernel_size=3,
...     var_pred_dropout=0.5,
... )
>>> out_tensor = duration_predictor(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 80])
forward(x)[source]
Parameters:

x (torch.Tensor (batch, channel, time)) – feature input tensor.

Return type:

Variance predictor output

class speechbrain.lobes.models.HifiGAN.UnitHifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True, vocab_size=100, embedding_dim=128, attn_dim=128, duration_predictor=False, var_pred_hidden_dim=128, var_pred_kernel_size=3, var_pred_dropout=0.5, multi_speaker=False, normalize_speaker_embeddings=False, skip_token_embedding=False, pooling_type='attention')[source]

Bases: HifiganGenerator

The UnitHiFiGAN generator takes discrete speech tokens as input. The generator is adapted to support bitrate scalability training. For more details, refer to: https://arxiv.org/abs/2406.10735.

Parameters:
  • in_channels (int) – number of input tensor channels.

  • out_channels (int) – number of output tensor channels.

  • resblock_type (str) – type of the ResBlock. β€˜1’ or β€˜2’.

  • resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.

  • resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.

  • upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.

  • upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.

  • upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.

  • inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.

  • cond_channels (int) – Whether to add a conv to the front

  • conv_post_bias (bool) – Whether to add a bias to the last conv

  • vocab_size (int) – size of the dictionary of embeddings.

  • embedding_dim (int) – size of each embedding vector.

  • attn_dim (int) – size of attention dimension.

  • duration_predictor (bool) – enable duration predictor module.

  • var_pred_hidden_dim (int) – size of hidden channels for the convolutional layers of the duration predictor.

  • var_pred_kernel_size (int) – size of the convolution filter in each layer of the duration predictor.

  • var_pred_dropout (float) – dropout probability of each layer in the duration predictor.

  • multi_speaker (bool) – enable multi speaker training.

  • normalize_speaker_embeddings (bool) – enable normalization of speaker embeddings.

  • skip_token_embedding (bool) – Whether to skip the embedding layer in the case of continuous input.

  • pooling_type (str, optional) – The type of pooling to use. Must be one of [β€œattention”, β€œsum”, β€œnone”]. Defaults to β€œattention” for scalable vocoder.

Example

>>> inp_tensor = torch.randint(0, 100, (4, 10, 1))
>>> unit_hifigan_generator = UnitHifiganGenerator(
...     in_channels=128,
...     out_channels=1,
...     resblock_type="1",
...     resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...     resblock_kernel_sizes=[3, 7, 11],
...     upsample_kernel_sizes=[11, 8, 8, 4, 4],
...     upsample_initial_channel=512,
...     upsample_factors=[5, 4, 4, 2, 2],
...     vocab_size=100,
...     embedding_dim=128,
...     duration_predictor=True,
...     var_pred_hidden_dim=128,
...     var_pred_kernel_size=3,
...     var_pred_dropout=0.5,
... )
>>> out_tensor, _ = unit_hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 3200])
forward(x, g=None, spk=None)[source]
Parameters:
  • x (torch.Tensor (batch, time, channel)) – feature input tensor.

  • g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.

  • spk (torch.Tensor) – Speaker embeddings

Return type:

Generator output

inference(x, spk=None)[source]

The inference function performs duration prediction and runs the forward method.

Parameters:
Return type:

Generator output

class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]

Bases: Module

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applies a stack of convolutions. Note:

if period is 2 waveform = [1, 2, 3, 4, 5, 6 …] –> [1, 3, 5 … ] –> convs -> score, feat

Parameters:
  • period (int) – Take every a new value every period

  • kernel_size (int) – Size of 1-d kernel for conv stack

  • stride (int) – Stride of conv stack

forward(x)[source]
Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Return type:

Scores and features

class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]

Bases: Module

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods. Periods are suggested to be prime numbers to reduce the overlap between each discriminator.

forward(x)[source]

Returns Multi-Period Discriminator scores and features

Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Return type:

Scores and features

class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]

Bases: Module

HiFiGAN Scale Discriminator. It is similar to MelganDiscriminator but with a specific architecture explained in the paper. SpeechBrain CNN wrappers are not used here because spectral_norm is not often used

Parameters:

use_spectral_norm (bool) – if True switch to spectral norm instead of weight norm.

forward(x)[source]
Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Return type:

Scores and features

class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]

Bases: Module

HiFiGAN Multi-Scale Discriminator. Similar to MultiScaleMelganDiscriminator but specially tailored for HiFiGAN as in the paper.

forward(x)[source]
Parameters:

x (torch.Tensor (batch, 1, time)) – input waveform.

Return type:

Scores and features

class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]

Bases: Module

HiFiGAN discriminator wrapping MPD and MSD.

Example

>>> inp_tensor = torch.rand([4, 1, 8192])
>>> hifigan_discriminator = HifiganDiscriminator()
>>> scores, feats = hifigan_discriminator(inp_tensor)
>>> len(scores)
8
>>> len(feats)
8
forward(x)[source]

Returns list of list of features from each layer of each discriminator.

Parameters:

x (torch.Tensor) – input waveform.

Return type:

Features from each discriminator layer

speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]

computes the Fourier transform of short overlapping windows of the input

class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]

Bases: Module

STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

Parameters:
  • n_fft (int) – size of Fourier transform.

  • hop_length (int) – the distance between neighboring sliding window frames.

  • win_length (int) – the size of window frame and STFT filter.

forward(y_hat, y)[source]

Returns magnitude loss and spectral convergence loss

Parameters:
  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

Return type:

Magnitude loss and spectral convergence loss

class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]

Bases: Module

Multi-scale STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

forward(y_hat, y)[source]

Returns multi-scale magnitude loss and spectral convergence loss

Parameters:
  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

Return type:

Magnitude loss and spectral convergence loss

class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]

Bases: Module

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

Parameters:
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_mel_channels (int) – Number of mel filterbanks.

  • n_fft (int) – Size of FFT.

  • n_stft (int) – Size of STFT.

  • mel_fmin (float) – Minimum frequency.

  • mel_fmax (float) – Maximum frequency.

  • mel_normalized (bool) – Whether to normalize by magnitude after stft.

  • power (float) – Exponent for the magnitude spectrogram.

  • norm (str or None) – If β€œslaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: β€œhtk” or β€œslaney”.

  • dynamic_range_compression (bool) – whether to do dynamic range compression

forward(y_hat, y)[source]

Returns L1 Loss over Spectrograms

Parameters:
  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

Return type:

L1 loss

class speechbrain.lobes.models.HifiGAN.MSEGLoss(*args, **kwargs)[source]

Bases: Module

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

forward(score_fake)[source]

Returns Generator GAN loss

Parameters:

score_fake (list) – discriminator scores of generated waveforms D(G(s))

Return type:

Generator loss

class speechbrain.lobes.models.HifiGAN.HingeGLoss(*args, **kwargs)[source]

Bases: Module

Hinge Generator Loss.

The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

Example

> import torch > score_fake = torch.randn(4, 88) > loss = HingeGLoss()(score_fake) > print(loss)

forward(score_fake)[source]

Returns Generator GAN loss

Parameters:

score_fake (torch.Tensor) – Discriminator scores of generated waveforms D(G(s))

Return type:

Generator loss

class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]

Bases: Module

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

forward(fake_feats, real_feats)[source]

Returns feature matching loss

Parameters:
  • fake_feats (list) – discriminator features of generated waveforms

  • real_feats (list) – discriminator features of groundtruth waveforms

Return type:

Feature matching loss

class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]

Bases: Module

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

forward(score_fake, score_real)[source]

Returns Discriminator GAN losses

Parameters:
  • score_fake (list) – discriminator scores of generated waveforms

  • score_real (list) – discriminator scores of groundtruth waveforms

Return type:

Discriminator losses

class speechbrain.lobes.models.HifiGAN.HingeDLoss(*args, **kwargs)[source]

Bases: Module

Hinge Discriminator Loss.

The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

Example

> import torch > score_fake = torch.randn(4, 88) > score_real = torch.randn(4, 88) > loss = HingeDLoss()(score_fake, score_real) > print(loss)

forward(score_fake, score_real)[source]

Returns Discriminator GAN losses

Parameters:
  • score_fake (torch.Tensor) – discriminator scores of generated waveforms

  • score_real (torch.Tensor) – discriminator scores of groundtruth waveforms

Return type:

Discriminator losses

class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0, mseg_dur_loss=None, mseg_dur_loss_weight=0)[source]

Bases: Module

Creates a summary of generator losses and applies weights for different losses

Parameters:
  • stft_loss (object) – object of stft loss

  • stft_loss_weight (float) – weight of STFT loss

  • mseg_loss (object) – object of mseg loss

  • mseg_loss_weight (float) – weight of mseg loss

  • feat_match_loss (object) – object of feature match loss

  • feat_match_loss_weight (float) – weight of feature match loss

  • l1_spec_loss (object) – object of L1 spectrogram loss

  • l1_spec_loss_weight (float) – weight of L1 spectrogram loss

  • mseg_dur_loss (object) – object of mseg duration loss

  • mseg_dur_loss_weight (float) – weight of mseg duration loss

forward(stage, y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None, log_dur_pred=None, log_dur=None)[source]

Returns a dictionary of generator losses and applies weights

Parameters:
  • stage (speechbrain.Stage) – training, validation or testing

  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

  • scores_fake (list) – discriminator scores of generated waveforms

  • feats_fake (list) – discriminator features of generated waveforms

  • feats_real (list) – discriminator features of groundtruth waveforms

  • log_dur_pred (torch.Tensor) – Predicted duration for duration loss

  • log_dur (torch.Tensor) – Real duration for duration loss

Return type:

Dictionary of generator losses

class speechbrain.lobes.models.HifiGAN.DiscriminatorLoss(msed_loss=None)[source]

Bases: Module

Creates a summary of discriminator losses

Parameters:

msed_loss (object) – object of MSE discriminator loss

forward(scores_fake, scores_real)[source]

Returns a dictionary of discriminator losses

Parameters:
  • scores_fake (list) – discriminator scores of generated waveforms

  • scores_real (list) – discriminator scores of groundtruth waveforms

Return type:

Dictionary of discriminator losses