speechbrain.lobes.models.DiffWave module

Neural network modules for DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS

For more details: https://arxiv.org/pdf/2009.09761.pdf

Authors
  • Yingzhi WANG 2022

Summary

Classes:

DiffWave

DiffWave Model with dilated residual blocks

DiffWaveDiffusion

An enhanced diffusion implementation with DiffWave-specific inference

DiffusionEmbedding

Embeds the diffusion step into an input vector of DiffWave

ResidualBlock

Residual Block with dilated convolution

SpectrogramUpsampler

Upsampler for spectrograms with Transposed Conv Only the upsamling is done here, the layer-specific Conv can be found in residual bloack to map the mel bands into 2× residual channels

Functions:

diffwave_mel_spectogram

calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training

Reference

speechbrain.lobes.models.DiffWave.diffwave_mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, audio)[source]

calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training

Parameters:
  • sample_rate (int) – Sample rate of audio signal.

  • hop_length (int) – Length of hop between STFT windows.

  • win_length (int) – Window size.

  • n_fft (int) – Size of FFT.

  • n_mels (int) – Number of mel filterbanks.

  • f_min (float) – Minimum frequency.

  • f_max (float) – Maximum frequency.

  • power (float) – Exponent for the magnitude spectrogram.

  • normalized (bool) – Whether to normalize by magnitude after stft.

  • norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band

  • mel_scale (str) – Scale to use: “htk” or “slaney”.

  • audio (torch.tensor) – input audio signal

class speechbrain.lobes.models.DiffWave.DiffusionEmbedding(max_steps)[source]

Bases: Module

Embeds the diffusion step into an input vector of DiffWave

Parameters:

max_steps (int) – total difussion steps

Example

>>> from speechbrain.lobes.models.DiffWave import DiffusionEmbedding
>>> diffusion_embedding = DiffusionEmbedding(max_steps=50)
>>> time_step = torch.randint(50, (1,))
>>> step_embedding = diffusion_embedding(time_step)
>>> step_embedding.shape
torch.Size([1, 512])
forward(diffusion_step)[source]

forward function of diffusion step embedding

Parameters:

diffusion_step (torch.Tensor) – which step of diffusion to execute

Returns:

diffusion step embedding

Return type:

tensor [bs, 512]

training: bool
class speechbrain.lobes.models.DiffWave.SpectrogramUpsampler[source]

Bases: Module

Upsampler for spectrograms with Transposed Conv Only the upsamling is done here, the layer-specific Conv can be found in residual bloack to map the mel bands into 2× residual channels

Example

>>> from speechbrain.lobes.models.DiffWave import SpectrogramUpsampler
>>> spec_upsampler = SpectrogramUpsampler()
>>> mel_input = torch.rand(3, 80, 100)
>>> upsampled_mel = spec_upsampler(mel_input)
>>> upsampled_mel.shape
torch.Size([3, 80, 25600])
forward(x)[source]

Upsamples spectrograms 256 times to match the length of audios Hop length should be 256 when extracting mel spectrograms

Parameters:

x (torch.Tensor) – input mel spectrogram [bs, 80, mel_len]

Return type:

upsampled spectrogram [bs, 80, mel_len*256]

training: bool
class speechbrain.lobes.models.DiffWave.ResidualBlock(n_mels, residual_channels, dilation, uncond=False)[source]

Bases: Module

Residual Block with dilated convolution

Parameters:
  • n_mels – input mel channels of conv1x1 for conditional vocoding task

  • residual_channels – channels of audio convolution

  • dilation – dilation cycles of audio convolution

  • uncond – conditional/unconditional generation

Example

>>> from speechbrain.lobes.models.DiffWave import ResidualBlock
>>> res_block = ResidualBlock(n_mels=80, residual_channels=64, dilation=3)
>>> noisy_audio = torch.randn(1, 1, 22050)
>>> timestep_embedding = torch.rand(1, 512)
>>> upsampled_mel = torch.rand(1, 80, 22050)
>>> output = res_block(noisy_audio, timestep_embedding, upsampled_mel)
>>> output[0].shape
torch.Size([1, 64, 22050])
forward(x, diffusion_step, conditioner=None)[source]

forward function of Residual Block

Parameters:
  • x (torch.Tensor) – input sample [bs, 1, time]

  • diffusion_step (torch.Tensor) – the embedding of which step of diffusion to execute

  • conditioner (torch.Tensor) – the condition used for conditional generation

Returns:

  • residual output [bs, residual_channels, time]

  • a skip of residual branch [bs, residual_channels, time]

training: bool
class speechbrain.lobes.models.DiffWave.DiffWave(input_channels, residual_layers, residual_channels, dilation_cycle_length, total_steps, unconditional=False)[source]

Bases: Module

DiffWave Model with dilated residual blocks

Parameters:
  • input_channels – input mel channels of conv1x1 for conditional vocoding task

  • residual_layers – number of residual blocks

  • residual_channels – channels of audio convolution

  • dilation_cycle_length – dilation cycles of audio convolution

  • total_steps – total steps of diffusion

  • unconditional – conditional/unconditional generation

Example

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> noisy_audio = torch.randn(1, 1, 25600)
>>> timestep = torch.randint(50, (1,))
>>> input_mel = torch.rand(1, 80, 100)
>>> predicted_noise = diffwave(noisy_audio, timestep, input_mel)
>>> predicted_noise.shape
torch.Size([1, 1, 25600])
forward(audio, diffusion_step, spectrogram=None, length=None)[source]

DiffWave forward function

Parameters:
  • audio (torch.Tensor) – input gaussian sample [bs, 1, time]

  • diffusion_steps (torch.Tensor) – which timestep of diffusion to execute [bs, 1]

  • spectrogram (torch.Tensor) – spectrogram data [bs, 80, mel_len]

  • length (torch.Tensor) – sample lengths - not used - provided for compatibility only

Return type:

predicted noise [bs, 1, time]

training: bool
class speechbrain.lobes.models.DiffWave.DiffWaveDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]

Bases: DenoisingDiffusion

An enhanced diffusion implementation with DiffWave-specific inference

Parameters:
  • model (nn.Module) – the underlying model

  • timesteps (int) – the total number of timesteps

  • noise (str|nn.Module) – the type of noise being used “gaussian” will produce standard Gaussian noise

  • beta_start (float) – the value of the “beta” parameter at the beginning of the process (see DiffWave paper)

  • beta_end (float) – the value of the “beta” parameter at the end of the process

  • show_progress (bool) – whether to show progress during inference

Example

>>> from speechbrain.lobes.models.DiffWave import DiffWave
>>> diffwave = DiffWave(
...     input_channels=80,
...     residual_layers=30,
...     residual_channels=64,
...     dilation_cycle_length=10,
...     total_steps=50,
... )
>>> from speechbrain.lobes.models.DiffWave import DiffWaveDiffusion
>>> from speechbrain.nnet.diffusion import GaussianNoise
>>> diffusion = DiffWaveDiffusion(
...     model=diffwave,
...     beta_start=0.0001,
...     beta_end=0.05,
...     timesteps=50,
...     noise=GaussianNoise,
... )
>>> input_mel = torch.rand(1, 80, 100)
>>> output = diffusion.inference(
...     unconditional=False,
...     scale=256,
...     condition=input_mel,
...     fast_sampling=True,
...     fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5],
... )
>>> output.shape
torch.Size([1, 25600])
training: bool
inference(unconditional, scale, condition=None, fast_sampling=False, fast_sampling_noise_schedule=None, device=None)[source]

Processes the inference for diffwave One inference function for all the locally/globally conditional generation and unconditional generation tasks :param unconditional: do unconditional generation if True, else do conditional generation :type unconditional: bool :param scale: scale to get the final output wave length

for conditional genration, the output wave length is scale * condition.shape[-1] for example, if the condition is spectrogram (bs, n_mel, time), scale should be hop length for unconditional generation, scale should be the desired audio length

Parameters:
  • condition (torch.Tensor) – input spectrogram for vocoding or other conditions for other conditional generation, should be None for unconditional generation

  • fast_sampling (bool) – whether to do fast sampling

  • fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling

  • device (str|torch.device) – inference device

Returns:

predicted_sample – the predicted audio (bs, 1, t)

Return type:

torch.Tensor