speechbrain.lobes.models.discrete.dac module

This lobe enables the integration of pretrained discrete DAC model. Reference: http://arxiv.org/abs/2306.06546 Reference: https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5 Reference: https://github.com/descriptinc/descript-audio-codec

Author

Shubham Gupta 2023

Summary

Classes:

`DAC`	Discrete Autoencoder Codec (DAC) for audio data encoding and decoding.
`Decoder`	A PyTorch module for the Decoder part of DAC.
`DecoderBlock`	A PyTorch module representing a block within the Decoder architecture.
`Encoder`	A PyTorch module for the Encoder part of DAC.
`EncoderBlock`	An encoder block module for convolutional neural networks.
`ResidualUnit`	A residual unit module for convolutional neural networks.
`ResidualVectorQuantize`	Introduced in SoundStream: An end2end neural audio codec https://arxiv.org/abs/2107.03312
`Snake1d`	A PyTorch module implementing the Snake activation function in 1D.
`VectorQuantize`	An implementation for Vector Quantization

Functions:

`WNConv1d`	Apply weight normalization to a 1D convolutional layer.
`WNConvTranspose1d`	Apply weight normalization to a 1D transposed convolutional layer.
`download`	Downloads a specified model file based on model type, bitrate, and tag, saving it to a local path.
`init_weights`	Initialize the weights of a 1D convolutional layer.

Reference

speechbrain.lobes.models.discrete.dac.WNConv1d(*args, **kwargs)[source]

Apply weight normalization to a 1D convolutional layer.

Parameters:

*args – Variable length argument list for nn.Conv1d.
**kwargs – Arbitrary keyword arguments for nn.Conv1d.

Returns:

The weight-normalized nn.Conv1d layer.

Return type:

torch.nn.Module

speechbrain.lobes.models.discrete.dac.WNConvTranspose1d(*args, **kwargs)[source]

Apply weight normalization to a 1D transposed convolutional layer.

Parameters:

*args – Variable length argument list for nn.ConvTranspose1d.
**kwargs – Arbitrary keyword arguments for nn.ConvTranspose1d.

Returns:

The weight-normalized nn.ConvTranspose1d layer.

Return type:

torch.nn.Module

speechbrain.lobes.models.discrete.dac.init_weights(m)[source]: Initialize the weights of a 1D convolutional layer.

speechbrain.lobes.models.discrete.dac.download(model_type: str = '44khz', model_bitrate: str = '8kbps', tag: str = 'latest', local_path: Path | None = None)[source]

Downloads a specified model file based on model type, bitrate, and tag, saving it to a local path.

Parameters:

model_type (str, optional) – The type of model to download. Can be ‘44khz’, ‘24khz’, or ‘16khz’. Default is ‘44khz’.
model_bitrate (str, optional) – The bitrate of the model. Can be ‘8kbps’ or ‘16kbps’. Default is ‘8kbps’.
tag (str, optional) – A specific version tag for the model. Default is ‘latest’.
local_path (Path, optional) – The local file path where the model will be saved. If not provided, a default path will be used.

Returns:

The local path where the model is saved.

Return type:

Path

Raises:

ValueError – If the model type or bitrate is not supported, or if the model cannot be found or downloaded.

class speechbrain.lobes.models.discrete.dac.VectorQuantize(input_dim: int, codebook_size: int, codebook_dim: int)[source]

Bases: Module

An implementation for Vector Quantization

__init__(input_dim: int, codebook_size: int, codebook_dim: int)[source]

Implementation of VQ similar to Karpathy’s repo: https://github.com/karpathy/deep-vector-quantization Additionally uses following tricks from Improved VQGAN (https://arxiv.org/pdf/2110.04627.pdf):

Factorized codes: Perform nearest neighbor lookup in low-dimensional space
for improved codebook usage

l2-normalized codes: Converts euclidean distance to cosine similarity which
improves training stability

forward(z: Tensor)[source]

Quantized the input tensor using a fixed codebook and returns the corresponding codebook vectors

Parameters:

z (Tensor[B x D x T]) –

Returns:

Tensor[B x D x T] – Quantized continuous representation of input
Tensor[1] – Commitment loss to train encoder to predict vectors closer to codebook entries
Tensor[1] – Codebook loss to update the codebook
Tensor[B x T] – Codebook indices (quantized discrete representation of input)
Tensor[B x D x T] – Projected latents (continuous representation of input before quantization)

embed_code(embed_id: Tensor)[source]

Embeds an ID using the codebook weights.

This method utilizes the codebook weights to embed the given ID.

Parameters:: embed_id (torch.Tensor) – The tensor containing IDs that need to be embedded.
Returns:: The embedded output tensor after applying the codebook weights.
Return type:: torch.Tensor

decode_code(embed_id: Tensor)[source]

Decodes the embedded ID by transposing the dimensions.

This method decodes the embedded ID by applying a transpose operation to the dimensions of the output tensor from the embed_code method.

Parameters:: embed_id (torch.Tensor) – The tensor containing embedded IDs.
Returns:: The decoded tensor
Return type:: torch.Tensor

decode_latents(latents: Tensor)[source]

Decodes latent representations into discrete codes by comparing with the codebook.

Parameters:: latents (torch.Tensor) – The latent tensor representations to be decoded.
Returns:: A tuple containing the decoded latent tensor (z_q) and the indices of the codes.
Return type:: Tuple[torch.Tensor, torch.Tensor]

training: bool

class speechbrain.lobes.models.discrete.dac.ResidualVectorQuantize(input_dim: int = 512, n_codebooks: int = 9, codebook_size: int = 1024, codebook_dim: int | list = 8, quantizer_dropout: float = 0.0)[source]

Bases: Module

Introduced in SoundStream: An end2end neural audio codec https://arxiv.org/abs/2107.03312

Example

Using a pretrained RVQ unit.

>>> dac = DAC(load_pretrained=True, model_type="44KHz", model_bitrate="8kbps", tag="latest")
>>> quantizer = dac.quantizer
>>> continuous_embeddings = torch.randn(1, 1024, 100) # Example shape: [Batch, Channels, Time]
>>> discrete_embeddings, codes, _, _, _ = quantizer(continuous_embeddings)

__init__(input_dim: int = 512, n_codebooks: int = 9, codebook_size: int = 1024, codebook_dim: int | list = 8, quantizer_dropout: float = 0.0)[source]

Initializes the ResidualVectorQuantize

Parameters:

input_dim (int, optional, by default 512) –
n_codebooks (int, optional, by default 9) –
codebook_size (int, optional, by default 1024) –
codebook_dim (Union[int, list], optional, by default 8) –
quantizer_dropout (float, optional, by default 0.0) –

forward(z, n_quantizers: int | None = None)[source]

Quantized the input tensor using a fixed set of n codebooks and returns the corresponding codebook vectors :param z: :type z: Tensor[B x D x T] :param n_quantizers: No. of quantizers to use

(n_quantizers < self.n_codebooks ex: for quantizer dropout) Note: if self.quantizer_dropout is True, this argument is ignored

when in training mode, and a random number of quantizers is used.

Returns:

z (Tensor[B x D x T]) – Quantized continuous representation of input
codes (Tensor[B x N x T]) – Codebook indices for each codebook (quantized discrete representation of input)
latents (Tensor[B x N*D x T]) – Projected latents (continuous representation of input before quantization)
vq/commitment_loss (Tensor[1]) – Commitment loss to train encoder to predict vectors closer to codebook entries
vq/codebook_loss (Tensor[1]) – Codebook loss to update the codebook

from_codes(codes: Tensor)[source]

Given the quantized codes, reconstruct the continuous representation :param codes: Quantized discrete representation of input :type codes: Tensor[B x N x T]

Returns:: Quantized continuous representation of input
Return type:: Tensor[B x D x T]

from_latents(latents: Tensor)[source]

Given the unquantized latents, reconstruct the continuous representation after quantization.

Parameters:

latents (Tensor[B x N x T]) – Continuous representation of input after projection

Returns:

Tensor[B x D x T] – Quantized representation of full-projected space
Tensor[B x D x T] – Quantized representation of latent space

training: bool

class speechbrain.lobes.models.discrete.dac.Snake1d(channels)[source]

Bases: Module

A PyTorch module implementing the Snake activation function in 1D.

Parameters:: channels (int) – The number of channels in the input tensor.

__init__(channels)[source]: Initializes Snake1d :param channels: :type channels: int

forward(x)[source]

Parameters:: x (torch.Tensor) –
Return type:: torch.Tensor

training: bool

class speechbrain.lobes.models.discrete.dac.ResidualUnit(dim: int = 16, dilation: int = 1)[source]

Bases: Module

A residual unit module for convolutional neural networks.

Parameters:

dim (int, optional) – The number of channels in the input tensor. Default is 16.
dilation (int, optional) – The dilation rate for the convolutional layers. Default is 1.

__init__(dim: int = 16, dilation: int = 1)[source]: Initializes the ResidualUnit :param dim: :type dim: int, optional, by default 16 :param dilation: :type dilation: int, optional, by default 1

forward(x: tensor) → tensor[source]

Parameters:: x (torch.tensor) –
Return type:: torch.tensor

training: bool

class speechbrain.lobes.models.discrete.dac.EncoderBlock(dim: int = 16, stride: int = 1)[source]

Bases: Module

An encoder block module for convolutional neural networks.

This module constructs an encoder block consisting of a series of ResidualUnits and a final Snake1d activation followed by a weighted normalized 1D convolution. This block can be used as part of an encoder in architectures like autoencoders.

Parameters:

dim (int, optional) – The number of output channels. Default is 16.
stride (int, optional) – The stride for the final convolutional layer. Default is 1.

__init__(dim: int = 16, stride: int = 1)[source]: Initializes the EncoderBlock :param dim: :type dim: int, optional, by default 16 :param stride: :type stride: int, optional, by default 1

forward(x: tensor)[source]

Parameters:: x (torch.tensor) –
Return type:: torch.tensor

training: bool

class speechbrain.lobes.models.discrete.dac.Encoder(d_model: int = 64, strides: list = [2, 4, 8, 8], d_latent: int = 64)[source]

Bases: Module

A PyTorch module for the Encoder part of DAC.

Parameters:

d_model (int, optional) – The initial dimensionality of the model. Default is 64.
strides (list, optional) – A list of stride values for downsampling in each EncoderBlock. Default is [2, 4, 8, 8].
d_latent (int, optional) – The dimensionality of the output latent space. Default is 64.

Example

Creating an Encoder instance >>> encoder = Encoder() >>> audio_input = torch.randn(1, 1, 44100) # Example shape: [Batch, Channels, Time] >>> continuous_embedding = encoder(audio_input)

Using a pretrained encoder.

>>> dac = DAC(load_pretrained=True, model_type="44KHz", model_bitrate="8kbps", tag="latest")
>>> encoder = dac.encoder
>>> audio_input = torch.randn(1, 1, 44100) # Example shape: [Batch, Channels, Time]
>>> continuous_embeddings = encoder(audio_input)

__init__(d_model: int = 64, strides: list = [2, 4, 8, 8], d_latent: int = 64)[source]

Initializes the Encoder

Parameters:

d_model (int, optional, by default 64) –
strides (list, optional, by default [2, 4, 8, 8]) –
d_latent (int, optional, by default 64) –

forward(x)[source]

Parameters:: x (torch.tensor) –
Return type:: torch.tensor

training: bool

class speechbrain.lobes.models.discrete.dac.DecoderBlock(input_dim: int = 16, output_dim: int = 8, stride: int = 1)[source]

Bases: Module

A PyTorch module representing a block within the Decoder architecture.

Parameters:

input_dim (int, optional) – The number of input channels. Default is 16.
output_dim (int, optional) – The number of output channels. Default is 8.
stride (int, optional) – The stride for the transposed convolution, controlling the upsampling. Default is 1.

__init__(input_dim: int = 16, output_dim: int = 8, stride: int = 1)[source]

Initializes the DecoderBlock

Parameters:

input_dim (int, optional, by default 16) –
output_dim (int, optional, by default 8) –
stride (int, optional, by default 1) –

forward(x)[source]

Parameters:: x (torch.tensor) –
Return type:: torch.tensor

training: bool

class speechbrain.lobes.models.discrete.dac.Decoder(input_channel: int, channels: int, rates: List[int], d_out: int = 1)[source]

Bases: Module

A PyTorch module for the Decoder part of DAC.

Parameters:

input_channel (int) – The number of channels in the input tensor.
channels (int) – The base number of channels for the convolutional layers.
rates (list) – A list of stride rates for each decoder block
d_out (int) – The out dimension of the final conv layer, Default is 1.

Example

Creating a Decoder instance

>>> decoder = Decoder(256, 1536,  [8, 8, 4, 2])
>>> discrete_embeddings = torch.randn(2, 256, 200) # Example shape: [Batch, Channels, Time]
>>> recovered_audio = decoder(discrete_embeddings)

Using a pretrained decoder. Note that the actual input should be proper discrete representation. Using randomly generated input here for illustration of use.

>>> dac = DAC(load_pretrained=True, model_type="44KHz", model_bitrate="8kbps", tag="latest")
>>> decoder = dac.decoder
>>> discrete_embeddings = torch.randn(1, 1024, 500) # Example shape: [Batch, Channels, Time]
>>> recovered_audio = decoder(discrete_embeddings)

__init__(input_channel: int, channels: int, rates: List[int], d_out: int = 1)[source]

Initializes Decoder

Parameters:

input_channel (int) –
channels (int) –
rates (List[int]) –
d_out (int, optional, by default 1) –

forward(x)[source]

Parameters:: x (torch.tensor) –
Return type:: torch.tensor

training: bool

class speechbrain.lobes.models.discrete.dac.DAC(encoder_dim: int = 64, encoder_rates: List[int] = [2, 4, 8, 8], latent_dim: int | None = None, decoder_dim: int = 1536, decoder_rates: List[int] = [8, 8, 4, 2], n_codebooks: int = 9, codebook_size: int = 1024, codebook_dim: int | list = 8, quantizer_dropout: bool = False, sample_rate: int = 44100, model_type: str = '44khz', model_bitrate: str = '8kbps', tag: str = 'latest', load_path: str | None = None, strict: bool = False, load_pretrained: bool = False)[source]

Bases: Module

Discrete Autoencoder Codec (DAC) for audio data encoding and decoding.

This class implements an autoencoder architecture with quantization for efficient audio processing. It includes an encoder, quantizer, and decoder for transforming audio data into a compressed latent representation and reconstructing it back into audio. This implementation supports both initializing a new model and loading a pretrained model.

Parameters:

encoder_dim (int) – Dimensionality of the encoder.
encoder_rates (List[int]) – Downsampling rates for each encoder layer.
latent_dim (int, optional) – Dimensionality of the latent space, automatically calculated if None.
decoder_dim (int) – Dimensionality of the decoder.
decoder_rates (List[int]) – Upsampling rates for each decoder layer.
n_codebooks (int) – Number of codebooks for vector quantization.
codebook_size (int) – Size of each codebook.
codebook_dim (Union[int, list]) – Dimensionality of each codebook entry.
quantizer_dropout (bool) – Whether to use dropout in the quantizer.
sample_rate (int) – Sample rate of the audio data.
model_type (str) – Type of the model to load (if pretrained).
model_bitrate (str) – Bitrate of the model to load (if pretrained).
tag (str) – Specific tag of the model to load (if pretrained).
load_path (str, optional) – Path to load the pretrained model from, automatically downloaded if None.
strict (bool) – Whether to strictly enforce the state dictionary match.
load_pretrained (bool) – Whether to load a pretrained model.

Example

Creating a new DAC instance:

>>> dac = DAC()
>>> audio_data = torch.randn(1, 1, 16000) # Example shape: [Batch, Channels, Time]
>>> tokens, embeddings = dac(audio_data)

Loading a pretrained DAC instance:

>>> dac = DAC(load_pretrained=True, model_type="44KHz", model_bitrate="8kbps", tag="latest")
>>> audio_data = torch.randn(1, 1, 16000) # Example shape: [Batch, Channels, Time]
>>> tokens, embeddings = dac(audio_data)

The tokens and the discrete embeddings obtained above or from other sources can be decoded:

>>> dac = DAC(load_pretrained=True, model_type="44KHz", model_bitrate="8kbps", tag="latest")
>>> audio_data = torch.randn(1, 1, 16000) # Example shape: [Batch, Channels, Time]
>>> tokens, embeddings = dac(audio_data)
>>> decoded_audio = dac.decode(embeddings)

__init__(encoder_dim: int = 64, encoder_rates: List[int] = [2, 4, 8, 8], latent_dim: int | None = None, decoder_dim: int = 1536, decoder_rates: List[int] = [8, 8, 4, 2], n_codebooks: int = 9, codebook_size: int = 1024, codebook_dim: int | list = 8, quantizer_dropout: bool = False, sample_rate: int = 44100, model_type: str = '44khz', model_bitrate: str = '8kbps', tag: str = 'latest', load_path: str | None = None, strict: bool = False, load_pretrained: bool = False)[source]

Initializes DAC

Parameters:

encoder_dim (int, optional, by default 64) –
encoder_rates (List[int], optional, by default [2, 4, 8, 8]) –
latent_dim (int, optional, by default None) –
decoder_dim (int, optional, by default 1536) –
decoder_rates (List[int], optional, by default [8, 8, 4, 2]) –
n_codebooks (int, optional, by default 9) –
codebook_size (int, optional, by default 1024) –
codebook_dim (Union[int, list], optional, by default 8) –
quantizer_dropout (bool, optional, by default False) –
sample_rate (int, optional, by default 44100) –
model_type (str, optional, by default "44khz") –
model_bitrate (str, optional, by default "8kbps") –
tag (str, optional, by default "latest") –
load_path (str, optional, by default None) –
strict (bool, optional, by default False) –
load_pretrained (bool, optional) – If True, then a pretrained model is loaded, by default False

encode(audio_data: Tensor, n_quantizers: int | None = None)[source]

Encode given audio data and return quantized latent codes

Parameters:

audio_data (Tensor[B x 1 x T]) – Audio data to encode
n_quantizers (int, optional) – Number of quantizers to use, by default None If None, all quantizers are used.

Returns:

“z” (Tensor[B x D x T]) – Quantized continuous representation of input
”codes” (Tensor[B x N x T]) – Codebook indices for each codebook (quantized discrete representation of input)
”latents” (Tensor[B x N*D x T]) – Projected latents (continuous representation of input before quantization)
”vq/commitment_loss” (Tensor[1]) – Commitment loss to train encoder to predict vectors closer to codebook entries
”vq/codebook_loss” (Tensor[1]) – Codebook loss to update the codebook
”length” (int) – Number of samples in input audio

decode(z: Tensor)[source]

Decode given latent codes and return audio data

Parameters:

z (Tensor[B x D x T]) – Quantized continuous representation of input
length (int, optional) – Number of samples in output audio, by default None

Returns:

torch.Tensor – Decoded audio data.

Return type:

shape B x 1 x length

training: bool

forward(audio_data: Tensor, sample_rate: int | None = None, n_quantizers: int | None = None)[source]

Model forward pass

Parameters:

audio_data (Tensor[B x 1 x T]) – Audio data to encode
sample_rate (int, optional) – Sample rate of audio data in Hz, by default None If None, defaults to self.sample_rate
n_quantizers (int, optional) – Number of quantizers to use, by default None. If None, all quantizers are used.

Returns:

“tokens” (Tensor[B x N x T]) – Codebook indices for each codebook (quantized discrete representation of input)
”embeddings” (Tensor[B x D x T]) – Quantized continuous representation of input