speechbrain.lobes.models.HifiGAN module

Neural network modules for the HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

For more details: https://arxiv.org/pdf/2010.05646.pdf, https://arxiv.org/abs/2406.10735

Authors

Jarod Duret 2021
Yingzhi WANG 2022

Summary

Classes:

`DiscriminatorLoss`	Creates a summary of discriminator losses
`DiscriminatorP`	HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applies a stack of convolutions. Note: if period is 2 waveform = [1, 2, 3, 4, 5, 6 ...] --> [1, 3, 5 ... ] --> convs -> score, feat.
`DiscriminatorS`	HiFiGAN Scale Discriminator.
`GeneratorLoss`	Creates a summary of generator losses and applies weights for different losses
`HifiganDiscriminator`	HiFiGAN discriminator wrapping MPD and MSD.
`HifiganGenerator`	HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)
`HingeDLoss`	Hinge Discriminator Loss.
`HingeGLoss`	Hinge Generator Loss.
`L1SpecLoss`	L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss
`MSEDLoss`	Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.
`MSEGLoss`	Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.
`MelganFeatureLoss`	Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).
`MultiPeriodDiscriminator`	HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the `PeriodDiscriminator` to apply it in different periods.
`MultiScaleDiscriminator`	HiFiGAN Multi-Scale Discriminator.
`MultiScaleSTFTLoss`	Multi-scale STFT loss.
`ResBlock1`	Residual Block Type 1, which has 3 convolutional layers in each convolution block.
`ResBlock2`	Residual Block Type 2, which has 2 convolutional layers in each convolution block.
`STFTLoss`	STFT loss.
`UnitHifiganGenerator`	The UnitHiFiGAN generator takes discrete speech tokens as input.
`VariancePredictor`	Variance predictor inspired from FastSpeech2

Functions:

`dynamic_range_compression`	Dynamique range compression for audio signals
`mel_spectogram`	calculates MelSpectrogram for a raw audio signal
`process_duration`	Process a given batch of code to extract consecutive unique elements and their associated features.
`stft`	computes the Fourier transform of short overlapping windows of the input

Reference

speechbrain.lobes.models.HifiGAN.dynamic_range_compression(x, C=1, clip_val=1e-05)[source]: Dynamique range compression for audio signals

speechbrain.lobes.models.HifiGAN.mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, compression, audio)[source]

calculates MelSpectrogram for a raw audio signal

Parameters:

sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
compression (bool) – whether to do dynamic range compression
audio (torch.tensor) – input audio signal

Return type:

Mel spectrogram

speechbrain.lobes.models.HifiGAN.process_duration(code, code_feat)[source]

Process a given batch of code to extract consecutive unique elements and their associated features.

Parameters:

code (torch.Tensor (batch, time)) – Tensor of code indices.
code_feat (torch.Tensor (batch, time, channel)) – Tensor of code features.

Returns:

uniq_code_feat_filtered (torch.Tensor (batch, time)) – Features of consecutive unique codes.
mask (torch.Tensor (batch, time)) – Padding mask for the unique codes.
uniq_code_count (torch.Tensor (n)) – Count of unique codes.

Example

>>> code = torch.IntTensor([[40, 18, 18, 10]])
>>> code_feat = torch.rand([1, 4, 128])
>>> out_tensor, mask, uniq_code = process_duration(code, code_feat)
>>> out_tensor.shape
torch.Size([1, 1, 128])
>>> mask.shape
torch.Size([1, 1])
>>> uniq_code.shape
torch.Size([1])

class speechbrain.lobes.models.HifiGAN.ResBlock1(channels, kernel_size=3, dilation=(1, 3, 5))[source]

Bases: Module

Residual Block Type 1, which has 3 convolutional layers in each convolution block.

Parameters:

channels (int) – number of hidden channels for the convolutional layers.
kernel_size (int) – size of the convolution filter in each layer.
dilation (list) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters:: x (torch.Tensor (batch, channel, time)) – input tensor.
Return type:: The ResBlock outputs

remove_weight_norm()[source]: This functions removes weight normalization during inference.

class speechbrain.lobes.models.HifiGAN.ResBlock2(channels, kernel_size=3, dilation=(1, 3))[source]

Bases: Module

Residual Block Type 2, which has 2 convolutional layers in each convolution block.

Parameters:

channels (int) – number of hidden channels for the convolutional layers.
kernel_size (int) – size of the convolution filter in each layer.
dilation (list) – list of dilation value for each conv layer in a block.

forward(x)[source]

Returns the output of ResBlock1

Parameters:: x (torch.Tensor (batch, channel, time)) – input tensor.
Return type:: The ResBlock outputs

remove_weight_norm()[source]: This functions removes weight normalization during inference.

class speechbrain.lobes.models.HifiGAN.HifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True)[source]

Bases: Module

HiFiGAN Generator with Multi-Receptive Field Fusion (MRF)

Parameters:

in_channels (int) – number of input tensor channels.
out_channels (int) – number of output tensor channels.
resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.
resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.
resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.
upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.
upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.
upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.
inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.
cond_channels (int) – If provided, adds a conv layer to the beginning of the forward.
conv_post_bias (bool) – Whether to add a bias term to the final conv.

Example

>>> inp_tensor = torch.rand([4, 80, 33])
>>> hifigan_generator = HifiganGenerator(
...     in_channels=80,
...     out_channels=1,
...     resblock_type="1",
...     resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...     resblock_kernel_sizes=[3, 7, 11],
...     upsample_kernel_sizes=[16, 16, 4, 4],
...     upsample_initial_channel=512,
...     upsample_factors=[8, 8, 2, 2],
... )
>>> out_tensor = hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 8448])

forward(x, g=None)[source]

Parameters:

x (torch.Tensor (batch, channel, time)) – feature input tensor.
g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.

Return type:

The generator outputs

remove_weight_norm()[source]: This functions removes weight normalization during inference.

inference(c, padding=True)[source]

The inference function performs a padding and runs the forward method.

Parameters:

c (torch.Tensor (batch, channel, time)) – feature input tensor.
padding (bool) – Whether to pad tensor before forward.

Return type:

The generator outputs

class speechbrain.lobes.models.HifiGAN.VariancePredictor(encoder_embed_dim, var_pred_hidden_dim, var_pred_kernel_size, var_pred_dropout)[source]

Bases: Module

Variance predictor inspired from FastSpeech2

Parameters:

encoder_embed_dim (int) – number of input tensor channels.
var_pred_hidden_dim (int) – size of hidden channels for the convolutional layers.
var_pred_kernel_size (int) – size of the convolution filter in each layer.
var_pred_dropout (float) – dropout probability of each layer.

Example

>>> inp_tensor = torch.rand([4, 80, 128])
>>> duration_predictor = VariancePredictor(
...     encoder_embed_dim=128,
...     var_pred_hidden_dim=128,
...     var_pred_kernel_size=3,
...     var_pred_dropout=0.5,
... )
>>> out_tensor = duration_predictor(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 80])

forward(x)[source]

Parameters:: x (torch.Tensor (batch, channel, time)) – feature input tensor.
Return type:: Variance predictor output

class speechbrain.lobes.models.HifiGAN.UnitHifiganGenerator(in_channels, out_channels, resblock_type, resblock_dilation_sizes, resblock_kernel_sizes, upsample_kernel_sizes, upsample_initial_channel, upsample_factors, inference_padding=5, cond_channels=0, conv_post_bias=True, vocab_size=100, embedding_dim=128, attn_dim=128, duration_predictor=False, var_pred_hidden_dim=128, var_pred_kernel_size=3, var_pred_dropout=0.5, multi_speaker=False, normalize_speaker_embeddings=False, skip_token_embedding=False, pooling_type='attention')[source]

Bases: HifiganGenerator

The UnitHiFiGAN generator takes discrete speech tokens as input. The generator is adapted to support bitrate scalability training. For more details, refer to: https://arxiv.org/abs/2406.10735.

Parameters:

in_channels (int) – number of input tensor channels.
out_channels (int) – number of output tensor channels.
resblock_type (str) – type of the ResBlock. ‘1’ or ‘2’.
resblock_dilation_sizes (List[List[int]]) – list of dilation values in each layer of a ResBlock.
resblock_kernel_sizes (List[int]) – list of kernel sizes for each ResBlock.
upsample_kernel_sizes (List[int]) – list of kernel sizes for each transposed convolution.
upsample_initial_channel (int) – number of channels for the first upsampling layer. This is divided by 2 for each consecutive upsampling layer.
upsample_factors (List[int]) – upsampling factors (stride) for each upsampling layer.
inference_padding (int) – constant padding applied to the input at inference time. Defaults to 5.
cond_channels (int) – Whether to add a conv to the front
conv_post_bias (bool) – Whether to add a bias to the last conv
vocab_size (int) – size of the dictionary of embeddings.
embedding_dim (int) – size of each embedding vector.
attn_dim (int) – size of attention dimension.
duration_predictor (bool) – enable duration predictor module.
var_pred_hidden_dim (int) – size of hidden channels for the convolutional layers of the duration predictor.
var_pred_kernel_size (int) – size of the convolution filter in each layer of the duration predictor.
var_pred_dropout (float) – dropout probability of each layer in the duration predictor.
multi_speaker (bool) – enable multi speaker training.
normalize_speaker_embeddings (bool) – enable normalization of speaker embeddings.
skip_token_embedding (bool) – Whether to skip the embedding layer in the case of continuous input.
pooling_type (str, optional) – The type of pooling to use. Must be one of [“attention”, “sum”, “none”]. Defaults to “attention” for scalable vocoder.

Example

>>> inp_tensor = torch.randint(0, 100, (4, 10, 1))
>>> unit_hifigan_generator = UnitHifiganGenerator(
...     in_channels=128,
...     out_channels=1,
...     resblock_type="1",
...     resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
...     resblock_kernel_sizes=[3, 7, 11],
...     upsample_kernel_sizes=[11, 8, 8, 4, 4],
...     upsample_initial_channel=512,
...     upsample_factors=[5, 4, 4, 2, 2],
...     vocab_size=100,
...     embedding_dim=128,
...     duration_predictor=True,
...     var_pred_hidden_dim=128,
...     var_pred_kernel_size=3,
...     var_pred_dropout=0.5,
... )
>>> out_tensor, _ = unit_hifigan_generator(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 1, 3200])

forward(x, g=None, spk=None)[source]

Parameters:

x (torch.Tensor (batch, time, channel)) – feature input tensor.
g (torch.Tensor (batch, 1, time)) – global conditioning input tensor.
spk (torch.Tensor) – Speaker embeddings

Return type:

Generator output

inference(x, spk=None)[source]

The inference function performs duration prediction and runs the forward method.

Parameters:

x (torch.Tensor (batch, time, channel)) – feature input tensor.
spk (torch.Tensor) – Speaker embeddings

Return type:

Generator output

class speechbrain.lobes.models.HifiGAN.DiscriminatorP(period, kernel_size=5, stride=3)[source]

Bases: Module

HiFiGAN Periodic Discriminator Takes every Pth value from the input waveform and applies a stack of convolutions. Note:

if period is 2 waveform = [1, 2, 3, 4, 5, 6 …] –> [1, 3, 5 … ] –> convs -> score, feat

Parameters:

period (int) – Take every a new value every period
kernel_size (int) – Size of 1-d kernel for conv stack
stride (int) – Stride of conv stack

forward(x)[source]

Parameters:: x (torch.Tensor (batch, 1, time)) – input waveform.
Return type:: Scores and features

class speechbrain.lobes.models.HifiGAN.MultiPeriodDiscriminator[source]

Bases: Module

HiFiGAN Multi-Period Discriminator (MPD) Wrapper for the PeriodDiscriminator to apply it in different periods. Periods are suggested to be prime numbers to reduce the overlap between each discriminator.

forward(x)[source]

Returns Multi-Period Discriminator scores and features

Parameters:: x (torch.Tensor (batch, 1, time)) – input waveform.
Return type:: Scores and features

class speechbrain.lobes.models.HifiGAN.DiscriminatorS(use_spectral_norm=False)[source]

Bases: Module

HiFiGAN Scale Discriminator. It is similar to MelganDiscriminator but with a specific architecture explained in the paper. SpeechBrain CNN wrappers are not used here because spectral_norm is not often used

Parameters:: use_spectral_norm (bool) – if True switch to spectral norm instead of weight norm.

forward(x)[source]

Parameters:: x (torch.Tensor (batch, 1, time)) – input waveform.
Return type:: Scores and features

class speechbrain.lobes.models.HifiGAN.MultiScaleDiscriminator[source]

Bases: Module

HiFiGAN Multi-Scale Discriminator. Similar to MultiScaleMelganDiscriminator but specially tailored for HiFiGAN as in the paper.

forward(x)[source]

Parameters:: x (torch.Tensor (batch, 1, time)) – input waveform.
Return type:: Scores and features

class speechbrain.lobes.models.HifiGAN.HifiganDiscriminator[source]

Bases: Module

HiFiGAN discriminator wrapping MPD and MSD.

Example

>>> inp_tensor = torch.rand([4, 1, 8192])
>>> hifigan_discriminator = HifiganDiscriminator()
>>> scores, feats = hifigan_discriminator(inp_tensor)
>>> len(scores)
8
>>> len(feats)
8

forward(x)[source]

Returns list of list of features from each layer of each discriminator.

Parameters:: x (torch.Tensor) – input waveform.
Return type:: Features from each discriminator layer

speechbrain.lobes.models.HifiGAN.stft(x, n_fft, hop_length, win_length, window_fn='hann_window')[source]: computes the Fourier transform of short overlapping windows of the input

class speechbrain.lobes.models.HifiGAN.STFTLoss(n_fft, hop_length, win_length)[source]

Bases: Module

STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

Parameters:

n_fft (int) – size of Fourier transform.
hop_length (int) – the distance between neighboring sliding window frames.
win_length (int) – the size of window frame and STFT filter.

forward(y_hat, y)[source]

Returns magnitude loss and spectral convergence loss

Parameters:

y_hat (torch.tensor) – generated waveform tensor
y (torch.tensor) – real waveform tensor

Return type:

Magnitude loss and spectral convergence loss

class speechbrain.lobes.models.HifiGAN.MultiScaleSTFTLoss(n_ffts=(1024, 2048, 512), hop_lengths=(120, 240, 50), win_lengths=(600, 1200, 240))[source]

Bases: Module

Multi-scale STFT loss. Input generate and real waveforms are converted to spectrograms compared with L1 and Spectral convergence losses. It is from ParallelWaveGAN paper https://arxiv.org/pdf/1910.11480.pdf

forward(y_hat, y)[source]

Returns multi-scale magnitude loss and spectral convergence loss

Parameters:

y_hat (torch.tensor) – generated waveform tensor
y (torch.tensor) – real waveform tensor

Return type:

Magnitude loss and spectral convergence loss

class speechbrain.lobes.models.HifiGAN.L1SpecLoss(sample_rate=22050, hop_length=256, win_length=24, n_mel_channels=80, n_fft=1024, n_stft=513, mel_fmin=0.0, mel_fmax=8000.0, mel_normalized=False, power=1.0, norm='slaney', mel_scale='slaney', dynamic_range_compression=True)[source]

Bases: Module

L1 Loss over Spectrograms as described in HiFiGAN paper https://arxiv.org/pdf/2010.05646.pdf Note : L1 loss helps leaning details compared with L2 loss

Parameters:

sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_mel_channels (int) – Number of mel filterbanks.
n_fft (int) – Size of FFT.
n_stft (int) – Size of STFT.
mel_fmin (float) – Minimum frequency.
mel_fmax (float) – Maximum frequency.
mel_normalized (bool) – Whether to normalize by magnitude after stft.
power (float) – Exponent for the magnitude spectrogram.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
dynamic_range_compression (bool) – whether to do dynamic range compression

forward(y_hat, y)[source]

Returns L1 Loss over Spectrograms

Parameters:

y_hat (torch.tensor) – generated waveform tensor
y (torch.tensor) – real waveform tensor

Return type:

L1 loss

class speechbrain.lobes.models.HifiGAN.MSEGLoss(*args, **kwargs)[source]

Bases: Module

Mean Squared Generator Loss The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

forward(score_fake)[source]

Returns Generator GAN loss

Parameters:: score_fake (list) – discriminator scores of generated waveforms D(G(s))
Return type:: Generator loss

class speechbrain.lobes.models.HifiGAN.HingeGLoss(*args, **kwargs)[source]

Bases: Module

Hinge Generator Loss.

The generator is trained to fake the discriminator by updating the sample quality to be classified to a value almost equal to 1.

Example

> import torch > score_fake = torch.randn(4, 88) > loss = HingeGLoss()(score_fake) > print(loss)

forward(score_fake)[source]

Returns Generator GAN loss

Parameters:: score_fake (torch.Tensor) – Discriminator scores of generated waveforms D(G(s))
Return type:: Generator loss

class speechbrain.lobes.models.HifiGAN.MelganFeatureLoss[source]

Bases: Module

Calculates the feature matching loss, which is a learned similarity metric measured by the difference in features of the discriminator between a ground truth sample and a generated sample (Larsen et al., 2016, Kumar et al., 2019).

forward(fake_feats, real_feats)[source]

Returns feature matching loss

Parameters:

fake_feats (list) – discriminator features of generated waveforms
real_feats (list) – discriminator features of groundtruth waveforms

Return type:

Feature matching loss

class speechbrain.lobes.models.HifiGAN.MSEDLoss[source]

Bases: Module

Mean Squared Discriminator Loss The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

forward(score_fake, score_real)[source]

Returns Discriminator GAN losses

Parameters:

score_fake (list) – discriminator scores of generated waveforms
score_real (list) – discriminator scores of groundtruth waveforms

Return type:

Discriminator losses

class speechbrain.lobes.models.HifiGAN.HingeDLoss(*args, **kwargs)[source]

Bases: Module

Hinge Discriminator Loss.

The discriminator is trained to classify ground truth samples to 1, and the samples synthesized from the generator to 0.

Example

> import torch > score_fake = torch.randn(4, 88) > score_real = torch.randn(4, 88) > loss = HingeDLoss()(score_fake, score_real) > print(loss)

forward(score_fake, score_real)[source]

Returns Discriminator GAN losses

Parameters:

score_fake (torch.Tensor) – discriminator scores of generated waveforms
score_real (torch.Tensor) – discriminator scores of groundtruth waveforms

Return type:

Discriminator losses

class speechbrain.lobes.models.HifiGAN.GeneratorLoss(stft_loss=None, stft_loss_weight=0, mseg_loss=None, mseg_loss_weight=0, feat_match_loss=None, feat_match_loss_weight=0, l1_spec_loss=None, l1_spec_loss_weight=0, mseg_dur_loss=None, mseg_dur_loss_weight=0)[source]

Bases: Module

Creates a summary of generator losses and applies weights for different losses

Parameters:

stft_loss (object) – object of stft loss
stft_loss_weight (float) – weight of STFT loss
mseg_loss (object) – object of mseg loss
mseg_loss_weight (float) – weight of mseg loss
feat_match_loss (object) – object of feature match loss
feat_match_loss_weight (float) – weight of feature match loss
l1_spec_loss (object) – object of L1 spectrogram loss
l1_spec_loss_weight (float) – weight of L1 spectrogram loss
mseg_dur_loss (object) – object of mseg duration loss
mseg_dur_loss_weight (float) – weight of mseg duration loss

forward(stage, y_hat=None, y=None, scores_fake=None, feats_fake=None, feats_real=None, log_dur_pred=None, log_dur=None)[source]

Returns a dictionary of generator losses and applies weights

Parameters:

stage (speechbrain.Stage) – training, validation or testing
y_hat (torch.tensor) – generated waveform tensor
y (torch.tensor) – real waveform tensor
scores_fake (list) – discriminator scores of generated waveforms
feats_fake (list) – discriminator features of generated waveforms
feats_real (list) – discriminator features of groundtruth waveforms
log_dur_pred (torch.Tensor) – Predicted duration for duration loss
log_dur (torch.Tensor) – Real duration for duration loss

Return type:

Dictionary of generator losses

class speechbrain.lobes.models.HifiGAN.DiscriminatorLoss(msed_loss=None)[source]

Bases: Module

Creates a summary of discriminator losses

Parameters:: msed_loss (object) – object of MSE discriminator loss

forward(scores_fake, scores_real)[source]

Returns a dictionary of discriminator losses

Parameters:

scores_fake (list) – discriminator scores of generated waveforms
scores_real (list) – discriminator scores of groundtruth waveforms

Return type:

Dictionary of discriminator losses