speechbrain.lobes.models.DiffWave moduleο
Neural network modules for DIFFWAVE: A VERSATILE DIFFUSION MODEL FOR AUDIO SYNTHESIS
For more details: https://arxiv.org/pdf/2009.09761.pdf
- Authors
Yingzhi WANG 2022
Summaryο
Classes:
DiffWave Model with dilated residual blocks |
|
An enhanced diffusion implementation with DiffWave-specific inference |
|
Embeds the diffusion step into an input vector of DiffWave |
|
Residual Block with dilated convolution |
|
Upsampler for spectrograms with Transposed Conv Only the upsampling is done here, the layer-specific Conv can be found in residual block to map the mel bands into 2Γ residual channels |
Functions:
calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training |
Referenceο
- speechbrain.lobes.models.DiffWave.diffwave_mel_spectogram(sample_rate, hop_length, win_length, n_fft, n_mels, f_min, f_max, power, normalized, norm, mel_scale, audio)[source]ο
calculates MelSpectrogram for a raw audio signal and preprocesses it for diffwave training
- Parameters:
sample_rate (int) β Sample rate of audio signal.
hop_length (int) β Length of hop between STFT windows.
win_length (int) β Window size.
n_fft (int) β Size of FFT.
n_mels (int) β Number of mel filterbanks.
f_min (float) β Minimum frequency.
f_max (float) β Maximum frequency.
power (float) β Exponent for the magnitude spectrogram.
normalized (bool) β Whether to normalize by magnitude after stft.
norm (str or None) β If βslaneyβ, divide the triangular mel weights by the width of the mel band
mel_scale (str) β Scale to use: βhtkβ or βslaneyβ.
audio (torch.tensor) β input audio signal
- Returns:
mel
- Return type:
- class speechbrain.lobes.models.DiffWave.DiffusionEmbedding(max_steps)[source]ο
Bases:
ModuleEmbeds the diffusion step into an input vector of DiffWave
- Parameters:
max_steps (int) β total diffusion steps
Example
>>> from speechbrain.lobes.models.DiffWave import DiffusionEmbedding >>> diffusion_embedding = DiffusionEmbedding(max_steps=50) >>> time_step = torch.randint(50, (1,)) >>> step_embedding = diffusion_embedding(time_step) >>> step_embedding.shape torch.Size([1, 512])
- forward(diffusion_step)[source]ο
forward function of diffusion step embedding
- Parameters:
diffusion_step (torch.Tensor) β which step of diffusion to execute
- Returns:
diffusion step embedding
- Return type:
tensor [bs, 512]
- class speechbrain.lobes.models.DiffWave.SpectrogramUpsampler[source]ο
Bases:
ModuleUpsampler for spectrograms with Transposed Conv Only the upsampling is done here, the layer-specific Conv can be found in residual block to map the mel bands into 2Γ residual channels
Example
>>> from speechbrain.lobes.models.DiffWave import SpectrogramUpsampler >>> spec_upsampler = SpectrogramUpsampler() >>> mel_input = torch.rand(3, 80, 100) >>> upsampled_mel = spec_upsampler(mel_input) >>> upsampled_mel.shape torch.Size([3, 80, 25600])
- forward(x)[source]ο
Upsamples spectrograms 256 times to match the length of audios Hop length should be 256 when extracting mel spectrograms
- Parameters:
x (torch.Tensor) β input mel spectrogram [bs, 80, mel_len]
- Return type:
upsampled spectrogram [bs, 80, mel_len*256]
- class speechbrain.lobes.models.DiffWave.ResidualBlock(n_mels, residual_channels, dilation, uncond=False)[source]ο
Bases:
ModuleResidual Block with dilated convolution
- Parameters:
Example
>>> from speechbrain.lobes.models.DiffWave import ResidualBlock >>> res_block = ResidualBlock(n_mels=80, residual_channels=64, dilation=3) >>> noisy_audio = torch.randn(1, 1, 22050) >>> timestep_embedding = torch.rand(1, 512) >>> upsampled_mel = torch.rand(1, 80, 22050) >>> output = res_block(noisy_audio, timestep_embedding, upsampled_mel) >>> output[0].shape torch.Size([1, 64, 22050])
- forward(x, diffusion_step, conditioner=None)[source]ο
forward function of Residual Block
- Parameters:
x (torch.Tensor) β input sample [bs, 1, time]
diffusion_step (torch.Tensor) β the embedding of which step of diffusion to execute
conditioner (torch.Tensor) β the condition used for conditional generation
- Returns:
residual output [bs, residual_channels, time]
a skip of residual branch [bs, residual_channels, time]
- class speechbrain.lobes.models.DiffWave.DiffWave(input_channels, residual_layers, residual_channels, dilation_cycle_length, total_steps, unconditional=False)[source]ο
Bases:
ModuleDiffWave Model with dilated residual blocks
- Parameters:
input_channels (int) β input mel channels of conv1x1 for conditional vocoding task
residual_layers (int) β number of residual blocks
residual_channels (int) β channels of audio convolution
dilation_cycle_length (int) β dilation cycles of audio convolution
total_steps (int) β total steps of diffusion
unconditional (bool) β conditional/unconditional generation
Example
>>> from speechbrain.lobes.models.DiffWave import DiffWave >>> diffwave = DiffWave( ... input_channels=80, ... residual_layers=30, ... residual_channels=64, ... dilation_cycle_length=10, ... total_steps=50, ... ) >>> noisy_audio = torch.randn(1, 1, 25600) >>> timestep = torch.randint(50, (1,)) >>> input_mel = torch.rand(1, 80, 100) >>> predicted_noise = diffwave(noisy_audio, timestep, input_mel) >>> predicted_noise.shape torch.Size([1, 1, 25600])
- forward(audio, diffusion_step, spectrogram=None, length=None)[source]ο
DiffWave forward function
- Parameters:
audio (torch.Tensor) β input gaussian sample [bs, 1, time]
diffusion_step (torch.Tensor) β which timestep of diffusion to execute [bs, 1]
spectrogram (torch.Tensor) β spectrogram data [bs, 80, mel_len]
length (torch.Tensor) β sample lengths - not used - provided for compatibility only
- Return type:
predicted noise [bs, 1, time]
- class speechbrain.lobes.models.DiffWave.DiffWaveDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]ο
Bases:
DenoisingDiffusionAn enhanced diffusion implementation with DiffWave-specific inference
- Parameters:
model (nn.Module) β the underlying model
timesteps (int) β the total number of timesteps
noise (str|nn.Module) β the type of noise being used βgaussianβ will produce standard Gaussian noise
beta_start (float) β the value of the βbetaβ parameter at the beginning of the process (see DiffWave paper)
beta_end (float) β the value of the βbetaβ parameter at the end of the process
sample_min (float)
sample_max (float) β Used to clip the output.
show_progress (bool) β whether to show progress during inference
Example
>>> from speechbrain.lobes.models.DiffWave import DiffWave >>> diffwave = DiffWave( ... input_channels=80, ... residual_layers=30, ... residual_channels=64, ... dilation_cycle_length=10, ... total_steps=50, ... ) >>> from speechbrain.lobes.models.DiffWave import DiffWaveDiffusion >>> from speechbrain.nnet.diffusion import GaussianNoise >>> diffusion = DiffWaveDiffusion( ... model=diffwave, ... beta_start=0.0001, ... beta_end=0.05, ... timesteps=50, ... noise=GaussianNoise, ... ) >>> input_mel = torch.rand(1, 80, 100) >>> output = diffusion.inference( ... unconditional=False, ... scale=256, ... condition=input_mel, ... fast_sampling=True, ... fast_sampling_noise_schedule=[0.0001, 0.001, 0.01, 0.05, 0.2, 0.5], ... ) >>> output.shape torch.Size([1, 25600])
- inference(unconditional, scale, condition=None, fast_sampling=False, fast_sampling_noise_schedule=None, device=None)[source]ο
Processes the inference for diffwave One inference function for all the locally/globally conditional generation and unconditional generation tasks
- Parameters:
unconditional (bool) β do unconditional generation if True, else do conditional generation
scale (int) β scale to get the final output wave length for conditional generation, the output wave length is scale * condition.shape[-1] for example, if the condition is spectrogram (bs, n_mel, time), scale should be hop length for unconditional generation, scale should be the desired audio length
condition (torch.Tensor) β input spectrogram for vocoding or other conditions for other conditional generation, should be None for unconditional generation
fast_sampling (bool) β whether to do fast sampling
fast_sampling_noise_schedule (list) β the noise schedules used for fast sampling
device (str|torch.device) β inference device
- Returns:
predicted_sample β the predicted audio (bs, 1, t)
- Return type: