speechbrain.lobes.models.huggingface_transformers.encodec moduleο
This lobe enables the integration of huggingface pretrained EnCodec.
EnCodec makes it possible to compress audio into a sequence of discrete tokens at different bandwidths - and to reconstruct audio from such sequences, with some loss of quality depending on the bandwidth.
Note that while encodec can be used to reconstruct speech data, for a high-quality reconstruction, it is recommended to use a specially trained vocoder, such as Vocos (speechbrain.lobes.models.huggingface_transformers.vocos)
Repository: https://huggingface.co/docs/transformers/v4.31.0/en/model_doc/encodec Paper: https://arxiv.org/abs/2210.13438
- Authors
Artem Ploujnikov 2023
Summaryο
Classes:
An wrapper for the HuggingFace encodec model |
Referenceο
- class speechbrain.lobes.models.huggingface_transformers.encodec.Encodec(source, save_path=None, sample_rate=None, bandwidth=1.5, flat_embeddings=False, freeze=True, renorm_embeddings=True)[source]ο
Bases:
HFTransformersInterface
An wrapper for the HuggingFace encodec model
- Parameters:
source (str) β A HuggingFace repository identifier or a path
save_path (str) β The location where the pretrained model will be saved
sample_rate (int) β The audio sampling rate
bandwidth (float) β The encoding bandwidth, in kbps (optional) Supported bandwidths: 1.5, 3.0, 6.0, 12.0, 24.0
flat_embeddings (bool) β If set to True, embeddings will be flattened into (Batch x Length x (Heads * Embedding))
freeze (bool) β whether the model will be frozen (e.g. not trainable if used as part of training another model)
renorm_embeddings (bool) β whether embeddings should be renormalized. In the original model.
Example
>>> model_hub = "facebook/encodec_24khz" >>> save_path = "savedir" >>> model = Encodec(model_hub, save_path) >>> audio = torch.randn(4, 1000) >>> length = torch.tensor([1.0, .5, .75, 1.0]) >>> tokens, emb = model.encode(audio, length) >>> tokens.shape torch.Size([4, 4, 2]) >>> emb.shape torch.Size([4, 4, 2, 128]) >>> rec = model.decode(tokens, length) >>> rec.shape torch.Size([4, 1, 1280]) >>> rec_emb = model.decode_emb(emb, length) >>> rec_emb.shape torch.Size([4, 1, 1280]) >>> rec_tokens = model.tokens(emb, length) >>> rec_tokens.shape torch.Size([4, 4, 2]) >>> model = Encodec(model_hub, save_path, flat_embeddings=True) >>> _, emb = model.encode(audio, length) >>> emb.shape torch.Size([4, 4, 256])
- calibrate(sample, length)[source]ο
Calibrates the normalization on a sound sample
- Parameters:
sample (torch.Tensor) β A (Batch x Samples) or (Batch x Channel x Samples) audio sample
length (torch.Tensor) β A tensor of relative lengths
- Returns:
emb_mean (torch.Tensor) β The embedding mean
emb_std (torch.Tensor) β The embedding standard deviation
- forward(inputs, length)[source]ο
Encodes the input audio as tokens
- Parameters:
inputs (torch.Tensor) β A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio
length (torch.Tensor) β A tensor of relative lengths
- Returns:
tokens β A (Batch X Tokens) tensor of audio tokens
- Return type:
torch.Tensor
- encode(inputs, length)[source]ο
Encodes the input audio as tokens and embeddings
- Parameters:
inputs (torch.Tensor) β A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio
length (torch.Tensor) β A tensor of relative lengths
- Returns:
tokens (torch.Tensor) β A (Batch x Tokens x Heads) tensor of audio tokens
emb (torch.Tensor) β Raw vector embeddings from the modelβs quantizers
- embeddings(tokens)[source]ο
Converts token indexes to vector embeddings
- Parameters:
tokens (torch.Tensor) β a (Batch x Length x Heads) tensor of token indexes
- Returns:
emb β a (Batch x Length x Heads x Embedding) tensor of raw vector embeddings from the modelβs quantizer codebooks
- Return type:
torch.Tensor
- decode(tokens, length=None)[source]ο
Decodes audio from tokens
- Parameters:
tokens (torch.Tensor) β A (Batch x Length x Heads) tensor of audio tokens
length (torch.Tensor) β A 1-D tensor of relative lengths
- Returns:
audio β the reconstructed audio
- Return type:
torch.Tensor
- tokens(emb, length=None)[source]ο
Comberts embeddings to raw tokens
- Parameters:
emb (torch.Tensor) β Raw embeddings
length (torch.Tensor) β A 1-D tensor of relative lengths. If supplied, padded positions will be zeroed out
- Returns:
tokens β A (Batch x Length) tensor of token indices
- Return type:
torch.Tensor
- decode_emb(emb, length)[source]ο
Decodes raw vector embeddings into audio
- Parameters:
emb (torch.Tensor) β A (Batch x Length x Heads x Embedding) tensor of raw vector embeddings
length (torch.Tensor) β The corresponding lengths of the inputs.
- Returns:
audio β the reconstructed audio
- Return type:
torch.Tensor