speechbrain.nnet.autoencoders module

Autoencoder implementation. Can be used for Latent Diffusion or in isolation

Authors
  • Artem Ploujnikov 2022

Summary

Classes:

Autoencoder

A standard interface for autoencoders

AutoencoderOutput

NormalizingAutoencoder

A classical (non-variational) autoencoder that does not use reparameterization but instead uses an ordinary normalization technique to constrain the latent space

VariationalAutoencoder

A Variational Autoencoder (VAE) implementation.

VariationalAutoencoderOutput

Reference

class speechbrain.nnet.autoencoders.Autoencoder(*args, **kwargs)[source]

Bases: Module

A standard interface for autoencoders

Example

>>> import torch
>>> from torch import nn
>>> from speechbrain.nnet.linear import Linear
>>> class SimpleAutoencoder(Autoencoder):
...    def __init__(self):
...        super().__init__()
...        self.enc = Linear(n_neurons=16, input_size=128)
...        self.dec = Linear(n_neurons=128, input_size=16)
...    def encode(self, x, length=None):
...        return self.enc(x)
...    def decode(self, x, length=None):
...        return self.dec(x)
>>> autoencoder = SimpleAutoencoder()
>>> x = torch.randn(4, 10, 128)
>>> x_enc = autoencoder.encode(x)
>>> x_enc.shape
torch.Size([4, 10, 16])
>>> x_enc_fw = autoencoder(x)
>>> x_enc_fw.shape
torch.Size([4, 10, 16])
>>> x_rec = autoencoder.decode(x_enc)
>>> x_rec.shape
torch.Size([4, 10, 128])
encode(x, length=None)[source]

Converts a sample from an original space (e.g. pixel or waveform) to a latent space

Parameters:
Returns:

latent – the latent representation

Return type:

torch.Tensor

decode(latent)[source]

Decodes the sample from a latent repsresentation

Parameters:

latent (torch.Tensor) – the latent representation

Returns:

result – the decoded sample

Return type:

torch.Tensor

forward(x)[source]

Performs the forward pass

Parameters:
training: bool
class speechbrain.nnet.autoencoders.VariationalAutoencoder(encoder, decoder, mean, log_var, len_dim=1, latent_padding=None, mask_latent=True, mask_out=True, out_mask_value=0.0, latent_mask_value=0.0, latent_stochastic=True)[source]

Bases: Autoencoder

A Variational Autoencoder (VAE) implementation.

Paper reference: https://arxiv.org/abs/1312.6114

Parameters:
  • encoder (torch.Module) – the encoder network

  • decoder (torch.Module) – the decoder network

  • mean (torch.Module) – the module that computes the mean

  • log_var (torch.Module) – the module that computes the log variance

  • mask_value (float) – The value with which outputs and latents will be masked

  • len_dim (None) – the length dimension

  • mask_latent (bool) – where to apply the length mask to the latent representation

  • mask_out (bool) – whether to apply the length mask to the output

  • out_mask_value (float) – the mask value used for the output

  • latent_mask_value (float) – the mask value used for the latent representation

  • latent_stochastic (bool) –

    if true, the “latent” parameter of VariationalAutoencoderOutput will be the latent space sample

    if false, it will be the mean

Example

The example below shows a very simple implementation of VAE, not suitable for actual experiments:

>>> import torch
>>> from torch import nn
>>> from speechbrain.nnet.linear import Linear
>>> vae_enc = Linear(n_neurons=16, input_size=128)
>>> vae_dec = Linear(n_neurons=128, input_size=16)
>>> vae_mean = Linear(n_neurons=16, input_size=16)
>>> vae_log_var = Linear(n_neurons=16, input_size=16)
>>> vae = VariationalAutoencoder(
...     encoder=vae_enc,
...     decoder=vae_dec,
...     mean=vae_mean,
...     log_var=vae_log_var,
... )
>>> x = torch.randn(4, 10, 128)

train_sample encodes a single batch and then reconstructs it

>>> vae_out = vae.train_sample(x)
>>> vae_out.rec.shape
torch.Size([4, 10, 128])
>>> vae_out.latent.shape
torch.Size([4, 10, 16])
>>> vae_out.mean.shape
torch.Size([4, 10, 16])
>>> vae_out.log_var.shape
torch.Size([4, 10, 16])
>>> vae_out.latent_sample.shape
torch.Size([4, 10, 16])

.encode() will return the mean corresponding to teh sample provided

>>> x_enc = vae.encode(x)
>>> x_enc.shape
torch.Size([4, 10, 16])

.reparameterize() performs the reparameterization trick

>>> x_enc = vae.encoder(x)
>>> mean = vae.mean(x_enc)
>>> log_var = vae.log_var(x_enc)
>>> x_repar = vae.reparameterize(mean, log_var)
>>> x_repar.shape
torch.Size([4, 10, 16])
encode(x, length=None)[source]

Converts a sample from an original space (e.g. pixel or waveform) to a latent space

Parameters:

x (torch.Tensor) – the original data representation

Returns:

latent – the latent representation

Return type:

torch.Tensor

decode(latent)[source]

Decodes the sample from a latent repsresentation

Parameters:

latent (torch.Tensor) – the latent representation

Returns:

result – the decoded sample

Return type:

torch.Tensor

reparameterize(mean, log_var)[source]

Applies the VAE reparameterization trick to get a latent space single latent space sample for decoding

Parameters:
  • mean (torch.Tensor) – the latent representation mean

  • log_var (torch.Tensor) – the logarithm of the latent representation variance

Returns:

sample – a latent space sample

Return type:

torch.Tensor

train_sample(x, length=None, out_mask_value=None, latent_mask_value=None)[source]

Provides a data sample for training the autoencoder

Parameters:
  • x (torch.Tensor) – the source data (in the sample space)

  • length (None) – the length (optional). If provided, latents and outputs will be masked

Returns:

result – a named tuple with the following values rec: torch.Tensor

the reconstruction

latent: torch.Tensor

the latent space sample

mean: torch.Tensor

the mean of the latent representation

log_var: torch.Tensor

the logarithm of the variance of the latent representation

Return type:

VariationalAutoencoderOutput

training: bool
class speechbrain.nnet.autoencoders.VariationalAutoencoderOutput(rec, latent, mean, log_var, latent_sample, latent_length)

Bases: tuple

latent

Alias for field number 1

latent_length

Alias for field number 5

latent_sample

Alias for field number 4

log_var

Alias for field number 3

mean

Alias for field number 2

rec

Alias for field number 0

class speechbrain.nnet.autoencoders.AutoencoderOutput(rec, latent, latent_length)

Bases: tuple

latent

Alias for field number 1

latent_length

Alias for field number 2

rec

Alias for field number 0

class speechbrain.nnet.autoencoders.NormalizingAutoencoder(encoder, decoder, latent_padding=None, norm=None, len_dim=1, mask_out=True, mask_latent=True, out_mask_value=0.0, latent_mask_value=0.0)[source]

Bases: Autoencoder

A classical (non-variational) autoencoder that does not use reparameterization but instead uses an ordinary normalization technique to constrain the latent space

Parameters:
  • encoder (torch.nn.Module) – the encoder to be used

  • decoder (torch.nn.Module) – the decoder to be used

  • norm (torch.nn.Module) – the normalization module

  • mask_latent (bool) – where to apply the length mask to the latent representation

  • mask_out (bool) – whether to apply the length mask to the output

  • out_mask_value (float) – the mask value used for the output

Examples

>>> import torch
>>> from torch import nn
>>> from speechbrain.nnet.linear import Linear
>>> ae_enc = Linear(n_neurons=16, input_size=128)
>>> ae_dec = Linear(n_neurons=128, input_size=16)
>>> ae = NormalizingAutoencoder(
...     encoder=ae_enc,
...     decoder=ae_dec,
... )
>>> x = torch.randn(4, 10, 128)
>>> x_enc = ae.encode(x)
>>> x_enc.shape
torch.Size([4, 10, 16])
>>> x_dec = ae.decode(x_enc)
>>> x_dec.shape
torch.Size([4, 10, 128])
encode(x, length=None)[source]

Converts a sample from an original space (e.g. pixel or waveform) to a latent space

Parameters:

x (torch.Tensor) – the original data representation

Returns:

latent – the latent representation

Return type:

torch.Tensor

decode(latent)[source]

Decodes the sample from a latent repsresentation

Parameters:

latent (torch.Tensor) – the latent representation

Returns:

result – the decoded sample

Return type:

torch.Tensor

train_sample(x, length=None, out_mask_value=None, latent_mask_value=None)[source]

Provides a data sample for training the autoencoder

Parameters:
  • x (torch.Tensor) – the source data (in the sample space)

  • length (None) – the length (optional). If provided, latents and outputs will be masked

Returns:

result – a named tuple with the following values rec: torch.Tensor

the reconstruction

latent: torch.Tensor

the latent space sample

Return type:

AutoencoderOutput

training: bool