speechbrain.lobes.models.huggingface_transformers.encodec module
This lobe enables the integration of huggingface pretrained EnCodec.
EnCodec makes it possible to compress audio into a sequence of discrete tokens at different bandwidths - and to reconstruct audio from such sequences, with some loss of quality depending on the bandwidth.
Note that while encodec can be used to reconstruct speech data, for a high-quality reconstruction, it is recommended to use a specially trained vocoder, such as Vocos (speechbrain.lobes.models.huggingface_transformers.vocos)
Repository: https://huggingface.co/docs/transformers/v4.31.0/en/model_doc/encodec Paper: https://arxiv.org/abs/2210.13438
- Authors
Artem Ploujnikov 2023
Summary
Classes:
An wrapper for the HuggingFace encodec model |
Reference
- class speechbrain.lobes.models.huggingface_transformers.encodec.Encodec(source, save_path=None, sample_rate=None, bandwidth=1.5, flat_embeddings=False, freeze=True, renorm_embeddings=True)[source]
Bases:
HFTransformersInterface
An wrapper for the HuggingFace encodec model
- Parameters:
source (str) – A HuggingFace repository identifier or a path
save_path (str) – The location where the pretrained model will be saved
sample_rate (int) – The audio sampling rate
bandwidth (float) – The encoding bandwidth, in kbps (optional) Supported bandwidths: 1.5, 3.0, 6.0, 12.0, 24.0
flat_embeddings (bool) – If set to True, embeddings will be flattened into (Batch x Length x (Heads * Embedding))
freeze (bool) – whether the model will be frozen (e.g. not trainable if used as part of training another model)
renorm_embeddings (bool) – whether embeddings should be renormalized. In the original model.
Example
>>> model_hub = "facebook/encodec_24khz" >>> save_path = "savedir" >>> model = Encodec(model_hub, save_path) >>> audio = torch.randn(4, 1000) >>> length = torch.tensor([1.0, .5, .75, 1.0]) >>> tokens, emb = model.encode(audio, length) >>> tokens.shape torch.Size([4, 4, 2]) >>> emb.shape torch.Size([4, 4, 2, 128]) >>> rec = model.decode(tokens, length) >>> rec.shape torch.Size([4, 1, 1280]) >>> rec_emb = model.decode_emb(emb, length) >>> rec_emb.shape torch.Size([4, 1, 1280]) >>> rec_tokens = model.tokens(emb, length) >>> rec_tokens.shape torch.Size([4, 4, 2]) >>> model = Encodec(model_hub, save_path, flat_embeddings=True) >>> _, emb = model.encode(audio, length) >>> emb.shape torch.Size([4, 4, 256])
- calibrate(sample, length)[source]
Calibrates the normalization on a sound sample
- Parameters:
sample (torch.Tensor) – A (Batch x Samples) or (Batch x Channel x Samples) audio sample
length (torch.Tensor) – A tensor of relative lengths
- Returns:
emb_mean (torch.Tensor) – The embedding mean
emb_std (torch.Tensor) – The embedding standard deviation
- forward(inputs, length)[source]
Encodes the input audio as tokens
- Parameters:
inputs (torch.Tensor) – A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio
length (torch.Tensor) – A tensor of relative lengths
- Returns:
tokens – A (Batch X Tokens) tensor of audio tokens
- Return type:
- encode(inputs, length)[source]
Encodes the input audio as tokens and embeddings
- Parameters:
inputs (torch.Tensor) – A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio
length (torch.Tensor) – A tensor of relative lengths
- Returns:
tokens (torch.Tensor) – A (Batch x Tokens x Heads) tensor of audio tokens
emb (torch.Tensor) – Raw vector embeddings from the model’s quantizers
- embeddings(tokens)[source]
Converts token indexes to vector embeddings
- Parameters:
tokens (torch.Tensor) – a (Batch x Length x Heads) tensor of token indexes
- Returns:
emb – a (Batch x Length x Heads x Embedding) tensor of raw vector embeddings from the model’s quantizer codebooks
- Return type:
- decode(tokens, length=None)[source]
Decodes audio from tokens
- Parameters:
tokens (torch.Tensor) – A (Batch x Length x Heads) tensor of audio tokens
length (torch.Tensor) – A 1-D tensor of relative lengths
- Returns:
audio – the reconstructed audio
- Return type:
- tokens(emb, length=None)[source]
Comberts embeddings to raw tokens
- Parameters:
emb (torch.Tensor) – Raw embeddings
length (torch.Tensor) – A 1-D tensor of relative lengths. If supplied, padded positions will be zeroed out
- Returns:
tokens – A (Batch x Length) tensor of token indices
- Return type:
- decode_emb(emb, length)[source]
Decodes raw vector embeddings into audio
- Parameters:
emb (torch.Tensor) – A (Batch x Length x Heads x Embedding) tensor of raw vector embeddings
- Returns:
audio – the reconstructed audio
- Return type: