speechbrain.integrations.huggingface.mimi moduleο
This lobe enables the integration of huggingface pretrained Mimi.
Mimi codec is a state-of-the-art audio neural codec, developed by Kyutai. It combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps.
Note that you need to install transformers>=4.45.1 to use this module.
Repository: https://huggingface.co/kyutai/mimi Paper: https://kyutai.org/Moshi.pdf
- Authors
Pooneh Mousavi 2024
Summaryο
Classes:
This lobe enables the integration of HuggingFace pretrained Mimi model. |
Referenceο
- class speechbrain.integrations.huggingface.mimi.Mimi(source, save_path, sample_rate=24000, freeze=True, num_codebooks=8)[source]ο
Bases:
HFTransformersInterfaceThis lobe enables the integration of HuggingFace pretrained Mimi model. Mimi codec is a state-of-the-art audio neural codec, developed by Kyutai. It combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps.
- Source paper:
- Transformers>=4.45.1 from HuggingFace needs to be installed:
- The code is adapted from the official HF Kyutai repository:
- Parameters:
source (str) β A HuggingFace repository identifier or a path
save_path (str) β The location where the pretrained model will be saved
sample_rate (int (default: 24000)) β The audio sampling rate
freeze (bool) β whether the model will be frozen (e.g. not trainable if used as part of training another model)
num_codebooks (int (default: 8)) β Number of codebooks. It could be [2,3,4,5,6,7,8]
Example
>>> model_hub = "kyutai/mimi" >>> save_path = "savedir" >>> model = Mimi(model_hub, save_path) >>> audio = torch.randn(4, 48000) >>> length = torch.tensor([1.0, 0.5, 0.75, 1.0]) >>> tokens, emb = model.encode(audio, length) >>> tokens.shape torch.Size([4, 8, 25]) >>> emb.shape torch.Size([4, 8, 25, 256]) >>> rec = model.decode(tokens, length) >>> rec.shape torch.Size([4, 1, 48000])
- forward(inputs, length)[source]ο
Encodes the input audio as tokens and embeddings and decodes audio from tokens
- Parameters:
inputs (torch.Tensor) β A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio
length (torch.Tensor) β A tensor of relative lengths
- Returns:
tokens (torch.Tensor) β A (Batch x Tokens x Heads) tensor of audio tokens
emb (torch.Tensor) β Raw vector embeddings from the modelβs quantizers
audio (torch.Tensor) β the reconstructed audio
- encode(inputs, length)[source]ο
Encodes the input audio as tokens and embeddings
- Parameters:
inputs (torch.Tensor) β A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio
length (torch.Tensor) β A tensor of relative lengths
- Returns:
tokens (torch.Tensor) β A (Batch x num_codebooks x Length) tensor of audio tokens
emb (torch.Tensor) β Raw vector embeddings from the modelβs quantizers
- decode(tokens, length=None)[source]ο
Decodes audio from tokens
- Parameters:
tokens (torch.Tensor) β A (Batch x num_codebooks x Length) tensor of audio tokens
length (torch.Tensor) β A 1-D tensor of relative lengths
- Returns:
audio β the reconstructed audio
- Return type: