speechbrain.lobes.models.HifiGAN module

Neural network modules for the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

For more details: https://arxiv.org/pdf/2010.05646.pdf

Authors
  • Duret Jarod 2021

  • Yingzhi WANG 2022

Summary

Classes:

DiscriminatorLoss

Creates a summary of discriminator losses

DiscriminatorP

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convoluations. Note: if period is 2 waveform = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> convs -> score, feat.

DiscriminatorS

HiFiGAN Scale Discriminator.

GeneratorLoss

Creates a summary of generator losses and applies weights for different losses

HifiganDiscriminator

HiFiGAN discriminator wrapping MPD and MSD.

HifiganGenerator

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

L1SpecLoss

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

MSEDLoss

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

MSEGLoss

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

MelganFeatureLoss

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

MultiPeriodDiscriminator

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods.

MultiScaleDiscriminator

HiFiGAN Multi-Scale Discriminator.

MultiScaleSTFTLoss

Multi-scale STFT loss.

ResBlock1

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

ResBlock2

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

STFTLoss

STFT loss.

Functions:

dynamic_range_compression

Dynamique range compression for audio signals

mel_spectogram

calculates MelSpectrogram for a raw audio signal

stft

computes the Fourier transform of short overlapping windows of the input

Reference

speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]

Dynamique range compression for audio signals

speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • compression (bool) – whether to do dynamic range compression

  • audio (torch.tensor) – input audio signal

class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]

Bases: Module

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

Parameters
  • channels (int) – number of hidden channels for the convolutional layers.

  • kernel_size (int) – size of the convolution filter in each layer.

  • dilations (list) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters

x (torch.Tensor (batch, channel, time)) – input tensor.

remove_weight_norm()[source]

This functions removes weight normalization during inference.

training: bool
class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]

Bases: Module

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

Parameters
  • channels (int) – number of hidden channels for the convolutional layers.

  • kernel_size (int) – size of the convolution filter in each layer.

  • dilations (list) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters

x (torch.Tensor (batch, channel, time)) – input tensor.

remove_weight_norm()[source]

This functions removes weight normalization during inference.

training: bool
class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]

Bases: Module

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

Parameters
  • in_channels (int) – number of input tensor channels.

  • out_channels (int) – number of output tensor channels.

  • resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.

  • resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.

  • resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.

  • upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.

  • upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.

  • upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.

  • inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.

Example

>>> inp_tensor = torch.rand([4, 80, 33])
>>> hifigan_generator= HifiganGenerator(
...    in_channels = 80,
...    out_channels = 1,
...    resblock_type = "1",
...    resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...    resblock_kernel_sizes = [3, 7, 11],
...    upsample_kernel_sizes = [16, 16, 4, 4],
...    upsample_initial_channel = 512,
...    upsample_factors = [8, 8, 2, 2],
... )
>>> out_tensor = hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 8448])
forward(x, g=None)[source]
Parameters
  • x (torch.Tensor (batch, channel, time)) – feature input tensor.

  • g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.

remove_weight_norm()[source]

This functions removes weight normalization during inference.

inference(c)[source]

The inference function performs a padding and runs the forward method.

Parameters

x (torch.Tensor (batch, channel, time)) – feature input tensor.

training: bool
class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]

Bases: Module

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convoluations. Note:

if period is 2 waveform = [1, 2, 3, 4, 5, 6 …] –> [1, 3, 5 … ] –> convs -> score, feat

Parameters

x (torch.Tensor (batch, 1, time)) – input waveform.

forward(x)[source]
Parameters

x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool
class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]

Bases: Module

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods. Periods are suggested to be prime numbers to reduce the overlap between each discriminator.

forward(x)[source]

Returns Multi-Period Discriminator scores and features

Parameters

x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool
class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]

Bases: Module

HiFiGAN Scale Discriminator. It is similar to MelganDiscriminator but with a specific architecture explained in the paper. SpeechBrain CNN wrappers are not used here beacause spectral_norm is not often used

Parameters

use_spectral_norm (bool) – if True switch to spectral norm instead of weight norm.

forward(x)[source]
Parameters

x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool
class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]

Bases: Module

HiFiGAN Multi-Scale Discriminator. Similar to MultiScaleMelganDiscriminator but specially tailored for HiFiGAN as in the paper.

forward(x)[source]
Parameters

x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool
class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]

Bases: Module

HiFiGAN discriminator wrapping MPD and MSD.

Example

>>> inp_tensor = torch.rand([4, 1, 8192])
>>> hifigan_discriminator= HifiganDiscriminator()
>>> scores, feats = hifigan_discriminator(inp_tensor)
>>> len(scores)
8
>>> len(feats)
8
forward(x)[source]

Returns list of list of features from each layer of each discriminator.

Parameters

x (torch.Tensor) – input waveform.

training: bool
speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]

computes the Fourier transform of short overlapping windows of the input

class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]

Bases: Module

STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

Parameters
  • n_fft (int) – size of Fourier transform.

  • hop_length (int) – the distance between neighboring sliding window frames.

  • win_length (int) – the size of window frame and STFT filter.

forward(y_hat, y)[source]

Returns magnitude loss and spectral convergence loss

Parameters
  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

training: bool
class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]

Bases: Module

Multi-scale STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

forward(y_hat, y)[source]

Returns multi-scale magnitude loss and spectral convergence loss

Parameters
  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

training: bool
class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]

Bases: Module

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

Parameters
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • compression (bool) – whether to do dynamic range compression

forward(y_hat, y)[source]

Returns L1 Loss over Spectrograms

Parameters
  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

training: bool
class speechbrain.lobes.models.HifiGAN.MSEGLoss[source]

Bases: Module

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

forward(score_fake)[source]

Returns Generator GAN loss

Parameters

score_fake (list) – discriminator scores of generated waveforms D(G(s))

training: bool
class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]

Bases: Module

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

forward(fake_feats, real_feats)[source]

Returns feature matching loss

Parameters
  • fake_feats (list) – discriminator features of generated waveforms

  • real_feats (list) – discriminator features of groundtruth waveforms

training: bool
class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]

Bases: Module

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

forward(score_fake, score_real)[source]

Returns Discriminator GAN losses

Parameters
  • score_fake (list) – discriminator scores of generated waveforms

  • score_real (list) – discriminator scores of groundtruth waveforms

training: bool
class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0)[source]

Bases: Module

Creates a summary of generator losses and applies weights for different losses

Parameters
  • stft_loss (object) – object of stft loss

  • stft_loss_weight (float) – weight of STFT loss

  • mseg_loss (object) – object of mseg loss

  • mseg_loss_weight (float) – weight of mseg loss

  • feat_match_loss (object) – object of feature match loss

  • feat_match_loss_weight (float) – weight of feature match loss

  • l1_spec_loss (object) – object of L1 spectrogram loss

  • l1_spec_loss_weight (float) – weight of L1 spectrogram loss

forward(y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None)[source]

Returns a dictionary of generator losses and applies weights

Parameters
  • y_hat (torch.tensor) – generated waveform tensor

  • y (torch.tensor) – real waveform tensor

  • scores_fake (list) – discriminator scores of generated waveforms

  • feats_fake (list) – discriminator features of generated waveforms

  • feats_real (list) – discriminator features of groundtruth waveforms

training: bool
class speechbrain.lobes.models.HifiGAN.DiscriminatorLoss(msed_loss=None)[source]

Bases: Module

Creates a summary of discriminator losses

Parameters

msed_loss (object) – object of MSE discriminator loss

forward(scores_fake, scores_real)[source]

Returns a dictionary of discriminator losses

Parameters
  • scores_fake (list) – discriminator scores of generated waveforms

  • scores_real (list) – discriminator scores of groundtruth waveforms

training: bool