speechbrain.lobes.models.HifiGAN module
Neural network modules for the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
For more details: https://arxiv.org/pdf/2010.05646.pdf
- Authors
Duret Jarod 2021
Yingzhi WANG 2022
Summary
Classes:
Creates a summary of discriminator losses |
|
HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convoluations. Note: if period is 2 waveform = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> convs -> score, feat. |
|
HiFiGAN Scale Discriminator. |
|
Creates a summary of generator losses and applies weights for different losses |
|
HiFiGAN discriminator wrapping MPD and MSD. |
|
HiFiGAN Generator with Multi-Receptive Field Fusion (MRF) |
|
L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss |
|
Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0. |
|
Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1. |
|
Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019). |
|
HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods. |
|
HiFiGAN Multi-Scale Discriminator. |
|
Multi-scale STFT loss. |
|
Residual Block Type 1, which has 3 convolutional layers in each convolution block. |
|
Residual Block Type 2, which has 2 convolutional layers in each convolution block. |
|
STFT loss. |
Functions:
Dynamique range compression for audio signals |
|
calculates MelSpectrogram for a raw audio signal |
|
computes the Fourier transform of short overlapping windows of the input |
Reference
- speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]
Dynamique range compression for audio signals
- speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]
calculates MelSpectrogram for a raw audio signal
- Parameters
sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
compression (bool) – whether to do dynamic range compression
audio (torch.tensor) – input audio signal
- class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]
Bases:
Module
Residual Block Type 1, which has 3 convolutional layers in each convolution block.
- Parameters
- forward(x)[source]
Returns the output of ResBlock1
- Parameters
x (torch.Tensor (batch, channel, time)) – input tensor.
- class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]
Bases:
Module
Residual Block Type 2, which has 2 convolutional layers in each convolution block.
- Parameters
- forward(x)[source]
Returns the output of ResBlock1
- Parameters
x (torch.Tensor (batch, channel, time)) – input tensor.
- class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]
Bases:
Module
HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)
- Parameters
in_channels (int) – number of input tensor channels.
out_channels (int) – number of output tensor channels.
resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.
resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.
resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.
upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.
upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.
upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.
inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.
Example
>>> inp_tensor = torch.rand([4, 80, 33]) >>> hifigan_generator= HifiganGenerator( ... in_channels = 80, ... out_channels = 1, ... resblock_type = "1", ... resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], ... resblock_kernel_sizes = [3, 7, 11], ... upsample_kernel_sizes = [16, 16, 4, 4], ... upsample_initial_channel = 512, ... upsample_factors = [8, 8, 2, 2], ... ) >>> out_tensor = hifigan_generator(inp_tensor) >>> out_tensor.shape torch.Size([4, 1, 8448])
- forward(x, g=None)[source]
- Parameters
x (torch.Tensor (batch, channel, time)) – feature input tensor.
g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.
- inference(c)[source]
The inference function performs a padding and runs the forward method.
- Parameters
x (torch.Tensor (batch, channel, time)) – feature input tensor.
- class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]
Bases:
Module
HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applied a stack of convoluations. Note:
if period is 2 waveform = [1, 2, 3, 4, 5, 6 …] –> [1, 3, 5 … ] –> convs -> score, feat
- Parameters
x (torch.Tensor (batch, 1, time)) – input waveform.
- forward(x)[source]
- Parameters
x (torch.Tensor (batch, 1, time)) – input waveform.
- class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]
Bases:
Module
HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods. Periods are suggested to be prime numbers to reduce the overlap between each discriminator.
- forward(x)[source]
Returns Multi-Period Discriminator scores and features
- Parameters
x (torch.Tensor (batch, 1, time)) – input waveform.
- class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]
Bases:
Module
HiFiGAN Scale Discriminator. It is similar to MelganDiscriminator but with a specific architecture explained in the paper. SpeechBrain CNN wrappers are not used here beacause spectral_norm is not often used
- Parameters
use_spectral_norm (bool) – if True switch to spectral norm instead of weight norm.
- forward(x)[source]
- Parameters
x (torch.Tensor (batch, 1, time)) – input waveform.
- class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]
Bases:
Module
HiFiGAN Multi-Scale Discriminator. Similar to MultiScaleMelganDiscriminator but specially tailored for HiFiGAN as in the paper.
- forward(x)[source]
- Parameters
x (torch.Tensor (batch, 1, time)) – input waveform.
- class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]
Bases:
Module
HiFiGAN discriminator wrapping MPD and MSD.
Example
>>> inp_tensor = torch.rand([4, 1, 8192]) >>> hifigan_discriminator= HifiganDiscriminator() >>> scores, feats = hifigan_discriminator(inp_tensor) >>> len(scores) 8 >>> len(feats) 8
- forward(x)[source]
Returns list of list of features from each layer of each discriminator.
- Parameters
x (torch.Tensor) – input waveform.
- speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]
computes the Fourier transform of short overlapping windows of the input
- class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]
Bases:
Module
STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf
- Parameters
- class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]
Bases:
Module
Multi-scale STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf
- class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]
Bases:
Module
L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss
- Parameters
sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
compression (bool) – whether to do dynamic range compression
- class speechbrain.lobes.models.HifiGAN.MSEGLoss[source]
Bases:
Module
Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.
- class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]
Bases:
Module
Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).
- class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]
Bases:
Module
Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.
- class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0)[source]
Bases:
Module
Creates a summary of generator losses and applies weights for different losses
- Parameters
stft_loss (object) – object of stft loss
stft_loss_weight (float) – weight of STFT loss
mseg_loss (object) – object of mseg loss
mseg_loss_weight (float) – weight of mseg loss
feat_match_loss (object) – object of feature match loss
feat_match_loss_weight (float) – weight of feature match loss
l1_spec_loss (object) – object of L1 spectrogram loss
l1_spec_loss_weight (float) – weight of L1 spectrogram loss