speechbrain.lobes.models.HifiGAN module

Neural network modules for the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

For more details: https://arxiv.org/pdf/2010.05646.pdf

Authors

Duret Jarod 2021
Yingzhi WANG 2022

Summary

Classes:

`DiscriminatorLoss`	Creates a summary of discriminator losses
`DiscriminatorP`	HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convoluations. Note: if period is 2 waveform = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> convs -> score, feat.
`DiscriminatorS`	HiFiGAN Scale Discriminator.
`GeneratorLoss`	Creates a summary of generator losses and applies weights for different losses
`HifiganDiscriminator`	HiFiGAN discriminator wrapping MPD and MSD.
`HifiganGenerator`	HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)
`L1SpecLoss`	L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss
`MSEDLoss`	Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.
`MSEGLoss`	Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.
`MelganFeatureLoss`	Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).
`MultiPeriodDiscriminator`	HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods.
`MultiScaleDiscriminator`	HiFiGAN Multi-Scale Discriminator.
`MultiScaleSTFTLoss`	Multi-scale STFT loss.
`ResBlock1`	Residual Block Type 1, which has 3 convolutional layers in each convolution block.
`ResBlock2`	Residual Block Type 2, which has 2 convolutional layers in each convolution block.
`STFTLoss`	STFT loss.

Functions:

`dynamic_range_compression`	Dynamique range compression for audio signals
`mel_spectogram`	calculates MelSpectrogram for a raw audio signal
`stft`	computes the Fourier transform of short overlapping windows of the input

Reference

speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]: Dynamique range compression for audio signals

speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters

sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
compression (bool) – whether to do dynamic range compression
audio (torch.tensor) – input audio signal

class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]

Bases: Module

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

Parameters

channels (int) – number of hidden channels for the convolutional layers.
kernel_size (int) – size of the convolution filter in each layer.
dilations (list) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters: x (torch.Tensor (batch, channel, time)) – input tensor.

remove_weight_norm()[source]: This functions removes weight normalization during inference.

training: bool

class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]

Bases: Module

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

Parameters

channels (int) – number of hidden channels for the convolutional layers.
kernel_size (int) – size of the convolution filter in each layer.
dilations (list) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters: x (torch.Tensor (batch, channel, time)) – input tensor.

remove_weight_norm()[source]: This functions removes weight normalization during inference.

training: bool

class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]

Bases: Module

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

Parameters

in_channels (int) – number of input tensor channels.
out_channels (int) – number of output tensor channels.
resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.
resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.
resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.
upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.
upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.
upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.
inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.

Example

>>> inp_tensor = torch.rand([4, 80, 33])
>>> hifigan_generator= HifiganGenerator(
...    in_channels = 80,
...    out_channels = 1,
...    resblock_type = "1",
...    resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...    resblock_kernel_sizes = [3, 7, 11],
...    upsample_kernel_sizes = [16, 16, 4, 4],
...    upsample_initial_channel = 512,
...    upsample_factors = [8, 8, 2, 2],
... )
>>> out_tensor = hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 8448])

forward(x, g=None)[source]

Parameters

x (torch.Tensor (batch, channel, time)) – feature input tensor.
g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.

remove_weight_norm()[source]: This functions removes weight normalization during inference.

inference(c)[source]

The inference function performs a padding and runs the forward method.

Parameters: x (torch.Tensor (batch, channel, time)) – feature input tensor.

training: bool

class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]

Bases: Module

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convoluations. Note:

if period is 2 waveform = [1, 2, 3, 4, 5, 6 …] –> [1, 3, 5 … ] –> convs -> score, feat

Parameters: x (torch.Tensor (batch, 1, time)) – input waveform.

forward(x)[source]

Parameters: x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool

class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]

Bases: Module

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods. Periods are suggested to be prime numbers to reduce the overlap between each discriminator.

forward(x)[source]

Returns Multi-Period Discriminator scores and features

Parameters: x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool

class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]

Bases: Module

HiFiGAN Scale Discriminator. It is similar to MelganDiscriminator but with a specific architecture explained in the paper. SpeechBrain CNN wrappers are not used here beacause spectral_norm is not often used

Parameters: use_spectral_norm (bool) – if True switch to spectral norm instead of weight norm.

forward(x)[source]

Parameters: x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool

class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]

Bases: Module

HiFiGAN Multi-Scale Discriminator. Similar to MultiScaleMelganDiscriminator but specially tailored for HiFiGAN as in the paper.

forward(x)[source]

Parameters: x (torch.Tensor (batch, 1, time)) – input waveform.

training: bool

class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]

Bases: Module

HiFiGAN discriminator wrapping MPD and MSD.

Example

>>> inp_tensor = torch.rand([4, 1, 8192])
>>> hifigan_discriminator= HifiganDiscriminator()
>>> scores, feats = hifigan_discriminator(inp_tensor)
>>> len(scores)
8
>>> len(feats)
8

forward(x)[source]

Returns list of list of features from each layer of each discriminator.

Parameters: x (torch.Tensor) – input waveform.

training: bool

speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]: computes the Fourier transform of short overlapping windows of the input

class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]

Bases: Module

STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

Parameters

n_fft (int) – size of Fourier transform.
hop_length (int) – the distance between neighboring sliding window frames.
win_length (int) – the size of window frame and STFT filter.

forward(y_hat, y)[source]

Returns magnitude loss and spectral convergence loss

Parameters

y_hat (torch.tensor) – generated waveform tensor
y (torch.tensor) – real waveform tensor

training: bool

class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]

Bases: Module

Multi-scale STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

forward(y_hat, y)[source]

Returns multi-scale magnitude loss and spectral convergence loss

Parameters

y_hat (torch.tensor) – generated waveform tensor
y (torch.tensor) – real waveform tensor

training: bool

class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]

Bases: Module

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

Parameters

sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
compression (bool) – whether to do dynamic range compression

forward(y_hat, y)[source]

Returns L1 Loss over Spectrograms

Parameters

y_hat (torch.tensor) – generated waveform tensor
y (torch.tensor) – real waveform tensor

training: bool

class speechbrain.lobes.models.HifiGAN.MSEGLoss[source]

Bases: Module

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

forward(score_fake)[source]

Returns Generator GAN loss

Parameters: score_fake (list) – discriminator scores of generated waveforms D(G(s))

training: bool

class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]

Bases: Module

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

forward(fake_feats, real_feats)[source]

Returns feature matching loss

Parameters

fake_feats (list) – discriminator features of generated waveforms
real_feats (list) – discriminator features of groundtruth waveforms

training: bool

class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]

Bases: Module

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

forward(score_fake, score_real)[source]

Returns Discriminator GAN losses

Parameters

score_fake (list) – discriminator scores of generated waveforms
score_real (list) – discriminator scores of groundtruth waveforms

training: bool

class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0)[source]

Bases: Module

Creates a summary of generator losses and applies weights for different losses

Parameters

stft_loss (object) – object of stft loss
stft_loss_weight (float) – weight of STFT loss
mseg_loss (object) – object of mseg loss
mseg_loss_weight (float) – weight of mseg loss
feat_match_loss (object) – object of feature match loss
feat_match_loss_weight (float) – weight of feature match loss
l1_spec_loss (object) – object of L1 spectrogram loss
l1_spec_loss_weight (float) – weight of L1 spectrogram loss

forward(y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None)[source]

Returns a dictionary of generator losses and applies weights

Parameters

y_hat (torch.tensor) – generated waveform tensor
y (torch.tensor) – real waveform tensor
scores_fake (list) – discriminator scores of generated waveforms
feats_fake (list) – discriminator features of generated waveforms
feats_real (list) – discriminator features of groundtruth waveforms

training: bool

class speechbrain.lobes.models.HifiGAN.DiscriminatorLoss(msed_loss=None)[source]

Bases: Module

Creates a summary of discriminator losses

Parameters: msed_loss (object) – object of MSE discriminator loss

forward(scores_fake, scores_real)[source]

Returns a dictionary of discriminator losses

Parameters

scores_fake (list) – discriminator scores of generated waveforms
scores_real (list) – discriminator scores of groundtruth waveforms

training: bool