speechbrain.lobes.models.DiffWave module
Neural network modules for DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS
For more details: https://arxiv.org/pdf/2009.09761.pdf
- Authors
Yingzhi WANG 2022
Summary
Classes:
DiffWave Model with dilated residual blocks |
|
An enhanced diffusion implementation with DiffWave-specific inference |
|
Embeds the diffusion step into an input vector of DiffWave |
|
Residual Block with dilated convolution |
|
Upsampler for spectrograms with Transposed Conv Only the upsamling is done here, the layer-specific Conv can be found in residual bloack to map the mel bands into 2× residual channels |
Functions:
calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training |
Reference
- speechbrain.lobes.models.DiffWave.diffwave_mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, audio)[source]
calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training
- Parameters:
sample_rate (int) – Sample rate of audio signal.
hop_length (int) – Length of hop between STFT windows.
win_length (int) – Window size.
n_fft (int) – Size of FFT.
n_mels (int) – Number of mel filterbanks.
f_min (float) – Minimum frequency.
f_max (float) – Maximum frequency.
power (float) – Exponent for the magnitude spectrogram.
normalized (bool) – Whether to normalize by magnitude after stft.
norm (str or None) – If “slaney”, divide the triangular mel weights by the width of the mel band
mel_scale (str) – Scale to use: “htk” or “slaney”.
audio (torch.tensor) – input audio signal
- class speechbrain.lobes.models.DiffWave.DiffusionEmbedding(max_steps)[source]
Bases:
Module
Embeds the diffusion step into an input vector of DiffWave
- Parameters:
max_steps (int) – total difussion steps
Example
>>> from speechbrain.lobes.models.DiffWave import DiffusionEmbedding >>> diffusion_embedding = DiffusionEmbedding(max_steps=50) >>> time_step = torch.randint(50, (1,)) >>> step_embedding = diffusion_embedding(time_step) >>> step_embedding.shape torch.Size([1, 512])
- forward(diffusion_step)[source]
forward function of diffusion step embedding
- Parameters:
diffusion_step (torch.Tensor) – which step of diffusion to execute
- Returns:
diffusion step embedding
- Return type:
tensor [bs, 512]
- class speechbrain.lobes.models.DiffWave.SpectrogramUpsampler[source]
Bases:
Module
Upsampler for spectrograms with Transposed Conv Only the upsamling is done here, the layer-specific Conv can be found in residual bloack to map the mel bands into 2× residual channels
Example
>>> from speechbrain.lobes.models.DiffWave import SpectrogramUpsampler >>> spec_upsampler = SpectrogramUpsampler() >>> mel_input = torch.rand(3, 80, 100) >>> upsampled_mel = spec_upsampler(mel_input) >>> upsampled_mel.shape torch.Size([3, 80, 25600])
- forward(x)[source]
Upsamples spectrograms 256 times to match the length of audios Hop length should be 256 when extracting mel spectrograms
- Parameters:
x (torch.Tensor) – input mel spectrogram [bs, 80, mel_len]
- Return type:
upsampled spectrogram [bs, 80, mel_len*256]
- class speechbrain.lobes.models.DiffWave.ResidualBlock(n_mels, residual_channels, dilation, uncond=False)[source]
Bases:
Module
Residual Block with dilated convolution
- Parameters:
n_mels – input mel channels of conv1x1 for conditional vocoding task
residual_channels – channels of audio convolution
dilation – dilation cycles of audio convolution
uncond – conditional/unconditional generation
Example
>>> from speechbrain.lobes.models.DiffWave import ResidualBlock >>> res_block = ResidualBlock(n_mels=80, residual_channels=64, dilation=3) >>> noisy_audio = torch.randn(1, 1, 22050) >>> timestep_embedding = torch.rand(1, 512) >>> upsampled_mel = torch.rand(1, 80, 22050) >>> output = res_block(noisy_audio, timestep_embedding, upsampled_mel) >>> output[0].shape torch.Size([1, 64, 22050])
- forward(x, diffusion_step, conditioner=None)[source]
forward function of Residual Block
- Parameters:
x (torch.Tensor) – input sample [bs, 1, time]
diffusion_step (torch.Tensor) – the embedding of which step of diffusion to execute
conditioner (torch.Tensor) – the condition used for conditional generation
- Returns:
residual output [bs, residual_channels, time]
a skip of residual branch [bs, residual_channels, time]
- class speechbrain.lobes.models.DiffWave.DiffWave(input_channels, residual_layers, residual_channels, dilation_cycle_length, total_steps, unconditional=False)[source]
Bases:
Module
DiffWave Model with dilated residual blocks
- Parameters:
input_channels – input mel channels of conv1x1 for conditional vocoding task
residual_layers – number of residual blocks
residual_channels – channels of audio convolution
dilation_cycle_length – dilation cycles of audio convolution
total_steps – total steps of diffusion
unconditional – conditional/unconditional generation
Example
>>> from speechbrain.lobes.models.DiffWave import DiffWave >>> diffwave = DiffWave( ... input_channels=80, ... residual_layers=30, ... residual_channels=64, ... dilation_cycle_length=10, ... total_steps=50, ... ) >>> noisy_audio = torch.randn(1, 1, 25600) >>> timestep = torch.randint(50, (1,)) >>> input_mel = torch.rand(1, 80, 100) >>> predicted_noise = diffwave(noisy_audio, timestep, input_mel) >>> predicted_noise.shape torch.Size([1, 1, 25600])
- forward(audio, diffusion_step, spectrogram=None, length=None)[source]
DiffWave forward function
- Parameters:
audio (torch.Tensor) – input gaussian sample [bs, 1, time]
diffusion_steps (torch.Tensor) – which timestep of diffusion to execute [bs, 1]
spectrogram (torch.Tensor) – spectrogram data [bs, 80, mel_len]
length (torch.Tensor) – sample lengths - not used - provided for compatibility only
- Return type:
predicted noise [bs, 1, time]
- class speechbrain.lobes.models.DiffWave.DiffWaveDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]
Bases:
DenoisingDiffusion
An enhanced diffusion implementation with DiffWave-specific inference
- Parameters:
model (nn.Module) – the underlying model
timesteps (int) – the total number of timesteps
noise (str|nn.Module) – the type of noise being used “gaussian” will produce standard Gaussian noise
beta_start (float) – the value of the “beta” parameter at the beginning of the process (see DiffWave paper)
beta_end (float) – the value of the “beta” parameter at the end of the process
show_progress (bool) – whether to show progress during inference
Example
>>> from speechbrain.lobes.models.DiffWave import DiffWave >>> diffwave = DiffWave( ... input_channels=80, ... residual_layers=30, ... residual_channels=64, ... dilation_cycle_length=10, ... total_steps=50, ... ) >>> from speechbrain.lobes.models.DiffWave import DiffWaveDiffusion >>> from speechbrain.nnet.diffusion import GaussianNoise >>> diffusion = DiffWaveDiffusion( ... model=diffwave, ... beta_start=0.0001, ... beta_end=0.05, ... timesteps=50, ... noise=GaussianNoise, ... ) >>> input_mel = torch.rand(1, 80, 100) >>> output = diffusion.inference( ... unconditional=False, ... scale=256, ... condition=input_mel, ... fast_sampling=True, ... fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5], ... ) >>> output.shape torch.Size([1, 25600])
- inference(unconditional, scale, condition=None, fast_sampling=False, fast_sampling_noise_schedule=None, device=None)[source]
Processes the inference for diffwave One inference function for all the locally/globally conditional generation and unconditional generation tasks :param unconditional: do unconditional generation if True, else do conditional generation :type unconditional: bool :param scale: scale to get the final output wave length
for conditional genration, the output wave length is scale * condition.shape[-1] for example, if the condition is spectrogram (bs, n_mel, time), scale should be hop length for unconditional generation, scale should be the desired audio length
- Parameters:
condition (torch.Tensor) – input spectrogram for vocoding or other conditions for other conditional generation, should be None for unconditional generation
fast_sampling (bool) – whether to do fast sampling
fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling
device (str|torch.device) – inference device
- Returns:
predicted_sample – the predicted audio (bs, 1, t)
- Return type: