speechbrain.lobes.models.huggingface_transformers.encodec module

This lobe enables the integration of huggingface pretrained EnCodec.

EnCodec makes it possible to compress audio into a sequence of discrete tokens at different bandwidths - and to reconstruct audio from such sequences, with some loss of quality depending on the bandwidth.

Note that while encodec can be used to reconstruct speech data, for a high-quality reconstruction, it is recommended to use a specially trained vocoder, such as Vocos (speechbrain.lobes.models.huggingface_transformers.vocos)

Repository: https://huggingface.co/docs/transformers/v4.31.0/en/model_doc/encodec Paper: https://arxiv.org/abs/2210.13438

Authors
  • Artem Ploujnikov 2023

Summary

Classes:

Encodec

An wrapper for the HuggingFace encodec model

Reference

class speechbrain.lobes.models.huggingface_transformers.encodec.Encodec(source, save_path=None, sample_rate=None, bandwidth=1.5, flat_embeddings=False, freeze=True, renorm_embeddings=True)[source]

Bases: HFTransformersInterface

An wrapper for the HuggingFace encodec model

Parameters:
  • source (str) – A HuggingFace repository identifier or a path

  • save_path (str) – The location where the pretrained model will be saved

  • sample_rate (int) – The audio sampling rate

  • bandwidth (float) – The encoding bandwidth, in kbps (optional) Supported bandwidths: 1.5, 3.0, 6.0, 12.0, 24.0

  • flat_embeddings (bool) – If set to True, embeddings will be flattened into (Batch x Length x (Heads * Embedding))

  • freeze (bool) – whether the model will be frozen (e.g. not trainable if used as part of training another model)

  • renorm_embeddings (bool) – whether embeddings should be renormalized. In the original model.

Example

>>> model_hub = "facebook/encodec_24khz"
>>> save_path = "savedir"
>>> model = Encodec(model_hub, save_path)
>>> audio = torch.randn(4, 1000)
>>> length = torch.tensor([1.0, .5, .75, 1.0])
>>> tokens, emb = model.encode(audio, length)
>>> tokens.shape
torch.Size([4, 4, 2])
>>> emb.shape
torch.Size([4, 4, 2, 128])
>>> rec = model.decode(tokens, length)
>>> rec.shape
torch.Size([4, 1, 1280])
>>> rec_emb = model.decode_emb(emb, length)
>>> rec_emb.shape
torch.Size([4, 1, 1280])
>>> rec_tokens = model.tokens(emb, length)
>>> rec_tokens.shape
torch.Size([4, 4, 2])
>>> model = Encodec(model_hub, save_path, flat_embeddings=True)
>>> _, emb = model.encode(audio, length)
>>> emb.shape
torch.Size([4, 4, 256])
calibrate(sample, length)[source]

Calibrates the normalization on a sound sample

Parameters:
  • sample (torch.Tensor) – A (Batch x Samples) or (Batch x Channel x Samples) audio sample

  • length (torch.Tensor) – A tensor of relative lengths

Returns:

  • emb_mean (torch.Tensor) – The embedding mean

  • emb_std (torch.Tensor) – The embedding standard deviation

forward(inputs, length)[source]

Encodes the input audio as tokens

Parameters:
  • inputs (torch.Tensor) – A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio

  • length (torch.Tensor) – A tensor of relative lengths

Returns:

tokens – A (Batch X Tokens) tensor of audio tokens

Return type:

torch.Tensor

encode(inputs, length)[source]

Encodes the input audio as tokens and embeddings

Parameters:
  • inputs (torch.Tensor) – A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio

  • length (torch.Tensor) – A tensor of relative lengths

Returns:

  • tokens (torch.Tensor) – A (Batch x Tokens x Heads) tensor of audio tokens

  • emb (torch.Tensor) – Raw vector embeddings from the model’s quantizers

embeddings(tokens)[source]

Converts token indexes to vector embeddings

Parameters:

tokens (torch.Tensor) – a (Batch x Length x Heads) tensor of token indexes

Returns:

emb – a (Batch x Length x Heads x Embedding) tensor of raw vector embeddings from the model’s quantizer codebooks

Return type:

torch.Tensor

decode(tokens, length=None)[source]

Decodes audio from tokens

Parameters:
  • tokens (torch.Tensor) – A (Batch x Length x Heads) tensor of audio tokens

  • length (torch.Tensor) – A 1-D tensor of relative lengths

Returns:

audio – the reconstructed audio

Return type:

torch.Tensor

tokens(emb, length=None)[source]

Comberts embeddings to raw tokens

Parameters:
  • emb (torch.Tensor) – Raw embeddings

  • length (torch.Tensor) – A 1-D tensor of relative lengths. If supplied, padded positions will be zeroed out

Returns:

tokens – A (Batch x Length) tensor of token indices

Return type:

torch.Tensor

decode_emb(emb, length)[source]

Decodes raw vector embeddings into audio

Parameters:
  • emb (torch.Tensor) – A (Batch x Length x Heads x Embedding) tensor of raw vector embeddings

  • length (torch.Tensor) – The corresponding lengths of the inputs.

Returns:

audio – the reconstructed audio

Return type:

torch.Tensor