speechbrain.nnet.diffusion module

An implementation of Denoising Diffusion

https://arxiv.org/pdf/2006.11239.pdf

Certain parts adopted from / inspired by denoising-diffusion-pytorch https://github.com/lucidrains/denoising-diffusion-pytorch

Authors

Artem Ploujnikov 2022

Summary

Classes:

`DenoisingDiffusion`	An implementation of a classic Denoising Diffusion Probabilistic Model (DDPM)
`Diffuser`	A base diffusion implementation
`DiffusionTrainSample`
`GaussianNoise`	Adds ordinary Gaussian noise
`LatentDiffusion`	A latent diffusion wrapper.
`LatentDiffusionTrainSample`
`LengthMaskedGaussianNoise`	Gaussian noise applied to padded samples.

Functions:

sample_timesteps

Returns a random sample of timesteps as a 1-D tensor (one dimension only)

Reference

class speechbrain.nnet.diffusion.Diffuser(model, timesteps, noise=None)[source]

Bases: Module

A base diffusion implementation

Parameters:

model (nn.Module) – the underlying model
timesteps (int) – the number of timesteps
noise (callable|str) –
the noise function/module to use

The following predefined types of noise are provided “gaussian”: Gaussian noise, applied to the whole sample “length_masked_gaussian”: Gaussian noise applied only

to the parts of the sample that is not padding

distort(x, timesteps=None)[source]

Adds noise to a batch of data

Parameters:

x (torch.Tensor) – the original data sample
timesteps (torch.Tensor) – a 1-D integer tensor of a length equal to the number of batches in x, where each entry corresponds to the timestep number for the batch. If omitted, timesteps will be randomly sampled

Returns:

result – a tensor of the same dimension as x

Return type:

torch.Tensor

train_sample(x, timesteps=None, condition=None, **kwargs)[source]

Creates a sample for the training loop with a corresponding target :param x: the original data sample :type x: torch.Tensor :param timesteps: a 1-D integer tensor of a length equal to the number of

batches in x, where each entry corresponds to the timestep number for the batch. If omitted, timesteps will be randomly sampled

Parameters:

condition (torch.tensor) – the condition used for conditional generation Should be omitted during unconditional generation

Returns:

pred (torch.Tensor) – the model output 0 prdicted noise
noise (torch.Tensor) – the noise being applied
noisy_sample – the sample with the noise applied

sample(shape, **kwargs)[source]

Generates the number of samples indicated by the count parameter

Parameters:: shape (enumerable) – the shape of the sample to generate
Returns:: result – the generated sample(s)
Return type:: torch.Tensor

forward(x, timesteps=None)[source]: Computes the forward pass, calls distort()

training: bool

class speechbrain.nnet.diffusion.DenoisingDiffusion(model, timesteps=None, noise=None, beta_start=None, beta_end=None, sample_min=None, sample_max=None, show_progress=False)[source]

Bases: Diffuser

An implementation of a classic Denoising Diffusion Probabilistic Model (DDPM)

Parameters:

model (nn.Module) – the underlying model
timesteps (int) – the number of timesteps
noise (str|nn.Module) – the type of noise being used “gaussian” will produce standard Gaussian noise

beta_start: float: the value of the “beta” parameter at the beginning at the end of the process (see the paper)
beta_end: float: the value of the “beta” parameter at the end of the process
show_progress: bool: whether to show progress during inference

Example

>>> from speechbrain.nnet.unet import UNetModel
>>> unet = UNetModel(
...     in_channels=1,
...     model_channels=16,
...     norm_num_groups=4,
...     out_channels=1,
...     num_res_blocks=1,
...     attention_resolutions=[]
... )
>>> diff = DenoisingDiffusion(
...     model=unet,
...     timesteps=5
... )
>>> x = torch.randn(4, 1, 64, 64)
>>> pred, noise, noisy_sample = diff.train_sample(x)
>>> pred.shape
torch.Size([4, 1, 64, 64])
>>> noise.shape
torch.Size([4, 1, 64, 64])
>>> noisy_sample.shape
torch.Size([4, 1, 64, 64])
>>> sample = diff.sample((2, 1, 64, 64))
>>> sample.shape
torch.Size([2, 1, 64, 64])

compute_coefficients()[source]: Computes diffusion coefficients (alphas and betas)

distort(x, noise=None, timesteps=None, **kwargs)[source]

Adds noise to the sample, in a forward diffusion process,

Parameters:

x (torch.Tensor) – a data sample of 2 or more dimensions, with the first dimension representing the batch
noise (torch.Tensor) – the noise to add
timesteps (torch.Tensor) – a 1-D integer tensor of a length equal to the number of batches in x, where each entry corresponds to the timestep number for the batch. If omitted, timesteps will be randomly sampled

Returns:

result – a tensor of the same dimension as x

Return type:

torch.Tensor

sample(shape, **kwargs)[source]

Generates the number of samples indicated by the count parameter

Parameters:: shape (enumerable) – the shape of the sample to generate
Returns:: result – the generated sample(s)
Return type:: torch.Tensor

sample_step(sample, timestep, **kwargs)[source]

Processes a single timestep for the sampling process

Parameters:

sample (torch.Tensor) – the sample for the following timestep
timestep (int) – the timestep number
predicted_sample (torch.Tensor) – the predicted sample (denoised by one step`)

training: bool

class speechbrain.nnet.diffusion.LatentDiffusion(autoencoder, diffusion, latent_downsample_factor=None, latent_pad_dim=1)[source]

Bases: Module

A latent diffusion wrapper. Latent diffusion is denoising diffusion applied to a latent space instead of the original data space

Parameters:

autoencoder (speechbrain.nnet.autoencoders.Autoencoder) – An autoencoder converting the original space to a latent space
diffusion (speechbrian.nnet.diffusion.Diffuser) – A diffusion wrapper
latent_downsample_factor (int) – The factor that latent space dimensions need to be divisible by. This is useful if the underlying model for the diffusion wrapper is based on a UNet-like architecture where the inputs are progressively downsampled and upsampled by factors of two
latent_pad_dims (int|list[int]) – the dimension(s) along which the latent space will be padded

Example

>>> import torch
>>> from torch import nn
>>> from speechbrain.nnet.CNN import Conv2d
>>> from speechbrain.nnet.autoencoders import NormalizingAutoencoder
>>> from speechbrain.nnet.unet import UNetModel

Set up a simple autoencoder (a real autoencoder would be a deep neural network)

>>> ae_enc = Conv2d(
...     kernel_size=3,
...     stride=4,
...     in_channels=1,
...     out_channels=1,
...     skip_transpose=True,
... )
>>> ae_dec = nn.ConvTranspose2d(
...     kernel_size=3,
...     stride=4,
...     in_channels=1,
...     out_channels=1,
...     output_padding=1
... )
>>> ae = NormalizingAutoencoder(
...     encoder=ae_enc,
...     decoder=ae_dec,
... )

Construct a diffusion model with a UNet architecture

>>> unet = UNetModel(
...     in_channels=1,
...     model_channels=16,
...     norm_num_groups=4,
...     out_channels=1,
...     num_res_blocks=1,
...     attention_resolutions=[]
... )
>>> diff = DenoisingDiffusion(
...     model=unet,
...     timesteps=5
... )
>>> latent_diff = LatentDiffusion(
...     autoencoder=ae,
...     diffusion=diff,
...     latent_downsample_factor=4,
...     latent_pad_dim=2
... )
>>> x = torch.randn(4, 1, 64, 64)
>>> latent_sample = latent_diff.train_sample_latent(x)
>>> diff_sample, ae_sample = latent_sample
>>> pred, noise, noisy_sample = diff_sample
>>> pred.shape
torch.Size([4, 1, 16, 16])
>>> noise.shape
torch.Size([4, 1, 16, 16])
>>> noisy_sample.shape
torch.Size([4, 1, 16, 16])
>>> ae_sample.latent.shape
torch.Size([4, 1, 16, 16])

Create a few samples (the shape given should be the shape of the latent space)

>>> sample = latent_diff.sample((2, 1, 16, 16))
>>> sample.shape
torch.Size([2, 1, 64, 64])

train_sample(x, **kwargs)[source]

Creates a sample for the training loop with a corresponding target

Parameters:

x (torch.Tensor) – the original data sample
timesteps (torch.Tensor) – a 1-D integer tensor of a length equal to the number of batches in x, where each entry corresponds to the timestep number for the batch. If omitted, timesteps will be randomly sampled

Returns:

pred (torch.Tensor) – the model output 0 prdicted noise
noise (torch.Tensor) – the noise being applied
noisy_sample – the sample with the noise applied

train_sample_latent(x, **kwargs)[source]

Returns a train sample with autoencoder output - can be used to jointly training the diffusion model and the autoencoder

Parameters:: x (torch.Tensor) – the original data sample

distort(x)[source]

Adds noise to the sample, in a forward diffusion process,

Parameters:

x (torch.Tensor) – a data sample of 2 or more dimensions, with the first dimension representing the batch
noise (torch.Tensor) – the noise to add
timesteps (torch.Tensor) – a 1-D integer tensor of a length equal to the number of batches in x, where each entry corresponds to the timestep number for the batch. If omitted, timesteps will be randomly sampled

Returns:

result – a tensor of the same dimension as x

Return type:

torch.Tensor

sample(shape)[source]

Obtains a sample out of the diffusion model

Parameters:: shape (torch.Tensor) –
Returns:: sample – the sample of the specified shape
Return type:: torch.Tensor

training: bool

speechbrain.nnet.diffusion.sample_timesteps(x, num_timesteps)[source]

Returns a random sample of timesteps as a 1-D tensor (one dimension only)

Parameters:

x (torch.Tensor) – a tensor of samples of any dimension
num_timesteps (int) – the total number of timesteps

class speechbrain.nnet.diffusion.GaussianNoise(*args, **kwargs)[source]

Bases: Module

Adds ordinary Gaussian noise

forward(sample, **kwargs)[source]

Forward pass

Parameters:: sample (the original sample) –

training: bool

class speechbrain.nnet.diffusion.LengthMaskedGaussianNoise(length_dim=1)[source]

Bases: Module

Gaussian noise applied to padded samples. No noise is added to positions that are part of padding

Parameters:: length_dim (int) – the

forward(sample, length=None, **kwargs)[source]

Creates Gaussian noise. If a tensor of lengths is provided, no noise is added to the padding positions.

sample: torch.Tensor: a batch of data
length: torch.Tensor: relative lengths

training: bool

class speechbrain.nnet.diffusion.DiffusionTrainSample(pred, noise, noisy_sample)

Bases: tuple

noise: Alias for field number 1

noisy_sample: Alias for field number 2

pred: Alias for field number 0

class speechbrain.nnet.diffusion.LatentDiffusionTrainSample(diffusion, autoencoder)

Bases: tuple

autoencoder: Alias for field number 1

diffusion: Alias for field number 0