speechbrain.lobes.models.HifiGAN module

Neural network modules for the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

For more details: https://arxiv.org/pdf/2010.05646.pdf

  • Duret Jarod 2021

  • Yingzhi WANG 2022




Creates a summary of discriminator losses


HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convoluations. Note: if period is 2 waveform = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> convs -> score, feat.


HiFiGAN Scale Discriminator.


Creates a summary of generator losses and applies weights for different losses


HiFiGAN discriminator wrapping MPD and MSD.


HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)


L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss


Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.


Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.


Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).


HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods.


HiFiGAN Multi-Scale Discriminator.


Multi-scale STFT loss.


Residual Block Type 1, which has 3 convolutional layers in each convolution block.


Residual Block Type 2, which has 2 convolutional layers in each convolution block.


STFT loss.



Dynamique range compression for audio signals


calculates MelSpectrogram for a raw audio signal


computes the Fourier transform of short overlapping windows of the input


speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]

Dynamique range compression for audio signals

speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • compression (bool) – whether to do dynamic range compression

  • audio (torch.tensor) – input audio signal

class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]

Bases: Module

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

  • channels (int) – number of hidden channels for the convolutional layers.

  • kernel_size (int) – size of the convolution filter in each layer.

  • dilations (list) – list of dilation value for each conv layer in a block.


Returns the output of ResBlock1


x (torch.Tensor (batch, channel, time)) – input tensor.


This functions removes weight normalization during inference.

training: bool
class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]

Bases: Module

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

  • channels (int) – number of hidden channels for the convolutional layers.

  • kernel_size (int) – size of the convolution filter in each layer.

  • dilations (list) – list of dilation value for each conv layer in a block.


Returns the output of ResBlock1


x (torch.Tensor (batch, channel, time)) – input tensor.


This functions removes weight normalization during inference.

training: bool
class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]

Bases: Module

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

  • in_channels (int) – number of input tensor channels.

  • out_channels (int) – number of output tensor channels.

  • resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.

  • resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.

  • resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.

  • upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.

  • upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.

  • upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.

  • inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.


>>> inp_tensor = torch.rand([4, 80, 33])
>>> hifigan_generator= HifiganGenerator(
...    in_channels = 80,
...    out_channels = 1,
...    resblock_type = "1",
...    resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...    resblock_kernel_sizes = [3, 7, 11],
...    upsample_kernel_sizes = [16, 16, 4, 4],
...    upsample_initial_channel = 512,
...    upsample_factors = [8, 8, 2, 2],
... )
>>> out_tensor = hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 8448])
forward(x, g=None)[source]
  • x (torch.Tensor (batch, channel, time)) – feature input tensor.

  • g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.


This functions removes weight normalization during inference.


The inference function performs a padding and runs the forward method.


x (torch.Tensor (batch, channel, time)) – feature input tensor.

training: bool
class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]

Bases: Module

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convoluations. Note:

if period is 2 waveform = [1, 2, 3, 4, 5, 6 …] –> [1, 3, 5 … ] –> convs -> score, feat


x (torch.Tensor (batch, 1, time)) – input waveform.


x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool
class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]

Bases: Module

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods. Periods are suggested to be prime numbers to reduce the overlap between each discriminator.


Returns Multi-Period Discriminator scores and features


x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool
class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]

Bases: Module

HiFiGAN Scale Discriminator. It is similar to MelganDiscriminator but with a specific architecture explained in the paper. SpeechBrain CNN wrappers are not used here beacause spectral_norm is not often used


use_spectral_norm (bool) – if True switch to spectral norm instead of weight norm.


x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool
class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]

Bases: Module

HiFiGAN Multi-Scale Discriminator. Similar to MultiScaleMelganDiscriminator but specially tailored for HiFiGAN as in the paper.


x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool
class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]

Bases: Module

HiFiGAN discriminator wrapping MPD and MSD.


>>> inp_tensor = torch.rand([4, 1, 8192])
>>> hifigan_discriminator= HifiganDiscriminator()
>>> scores, feats = hifigan_discriminator(inp_tensor)
>>> len(scores)
>>> len(feats)

Returns list of list of features from each layer of each discriminator.


x (torch.Tensor) – input waveform.

training: bool
speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]

computes the Fourier transform of short overlapping windows of the input

class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]

Bases: Module

STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

  • n_fft (int) – size of Fourier transform.

  • hop_length (int) – the distance between neighboring sliding window frames.

  • win_length (int) – the size of window frame and STFT filter.

forward(y_hat, y)[source]

Returns magnitude loss and spectral convergence loss

  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

training: bool
class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]

Bases: Module

Multi-scale STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

forward(y_hat, y)[source]

Returns multi-scale magnitude loss and spectral convergence loss

  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

training: bool
class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]

Bases: Module

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • compression (bool) – whether to do dynamic range compression

forward(y_hat, y)[source]

Returns L1 Loss over Spectrograms

  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

training: bool
class speechbrain.lobes.models.HifiGAN.MSEGLoss[source]

Bases: Module

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.


Returns Generator GAN loss


score_fake (list) – discriminator scores of generated waveforms D(G(s))

training: bool
class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]

Bases: Module

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

forward(fake_feats, real_feats)[source]

Returns feature matching loss

  • fake_feats (list) – discriminator features of generated waveforms

  • real_feats (list) – discriminator features of groundtruth waveforms

training: bool
class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]

Bases: Module

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

forward(score_fake, score_real)[source]

Returns Discriminator GAN losses

  • score_fake (list) – discriminator scores of generated waveforms

  • score_real (list) – discriminator scores of groundtruth waveforms

training: bool
class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0)[source]

Bases: Module

Creates a summary of generator losses and applies weights for different losses

  • stft_loss (object) – object of stft loss

  • stft_loss_weight (float) – weight of STFT loss

  • mseg_loss (object) – object of mseg loss

  • mseg_loss_weight (float) – weight of mseg loss

  • feat_match_loss (object) – object of feature match loss

  • feat_match_loss_weight (float) – weight of feature match loss

  • l1_spec_loss (object) – object of L1 spectrogram loss

  • l1_spec_loss_weight (float) – weight of L1 spectrogram loss

forward(y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None)[source]

Returns a dictionary of generator losses and applies weights

  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

  • scores_fake (list) – discriminator scores of generated waveforms

  • feats_fake (list) – discriminator features of generated waveforms

  • feats_real (list) – discriminator features of groundtruth waveforms

training: bool
class speechbrain.lobes.models.HifiGAN.DiscriminatorLoss(msed_loss=None)[source]

Bases: Module

Creates a summary of discriminator losses


msed_loss (object) – object of MSE discriminator loss

forward(scores_fake, scores_real)[source]

Returns a dictionary of discriminator losses

  • scores_fake (list) – discriminator scores of generated waveforms

  • scores_real (list) – discriminator scores of groundtruth waveforms

training: bool