speechbrain.integrations.huggingface.mimi module

This lobe enables the integration of huggingface pretrained Mimi.

Mimi codec is a state-of-the-art audio neural codec, developed by Kyutai. It combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps.

Note that you need to install transformers>=4.45.1 to use this module.

Repository: https://huggingface.co/kyutai/mimi Paper: https://kyutai.org/Moshi.pdf

Authors

Pooneh Mousavi 2024

Summary

Classes:

Mimi

This lobe enables the integration of HuggingFace pretrained Mimi model.

Reference

class speechbrain.integrations.huggingface.mimi.Mimi(source, save_path, sample_rate=24000, freeze=True, num_codebooks=8)[source]

Bases: HFTransformersInterface

This lobe enables the integration of HuggingFace pretrained Mimi model. Mimi codec is a state-of-the-art audio neural codec, developed by Kyutai. It combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps.

Source paper:: https://kyutai.org/Moshi.pdf
Transformers>=4.45.1 from HuggingFace needs to be installed:: https://huggingface.co/transformers/installation.html
The code is adapted from the official HF Kyutai repository:: https://huggingface.co/kyutai/mimi

Parameters:

source (str) – A HuggingFace repository identifier or a path
save_path (str) – The location where the pretrained model will be saved
sample_rate (int (default: 24000)) – The audio sampling rate
freeze (bool) – whether the model will be frozen (e.g. not trainable if used as part of training another model)
num_codebooks (int (default: 8)) – Number of codebooks. It could be [2,3,4,5,6,7,8]

Example

>>> model_hub = "kyutai/mimi"
>>> save_path = "savedir"
>>> model = Mimi(model_hub, save_path)
>>> audio = torch.randn(4, 48000)
>>> length = torch.tensor([1.0, 0.5, 0.75, 1.0])
>>> tokens, emb = model.encode(audio, length)
>>> tokens.shape
torch.Size([4, 8, 25])
>>> emb.shape
torch.Size([4, 8, 25, 256])
>>> rec = model.decode(tokens, length)
>>> rec.shape
torch.Size([4, 1, 48000])

forward(inputs, length)[source]

Encodes the input audio as tokens and embeddings and decodes audio from tokens

Parameters:

inputs (torch.Tensor) – A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio
length (torch.Tensor) – A tensor of relative lengths

Returns:

tokens (torch.Tensor) – A (Batch x Tokens x Heads) tensor of audio tokens
emb (torch.Tensor) – Raw vector embeddings from the model’s quantizers
audio (torch.Tensor) – the reconstructed audio

encode(inputs, length)[source]

Encodes the input audio as tokens and embeddings

Parameters:

inputs (torch.Tensor) – A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio
length (torch.Tensor) – A tensor of relative lengths

Returns:

tokens (torch.Tensor) – A (Batch x num_codebooks x Length) tensor of audio tokens
emb (torch.Tensor) – Raw vector embeddings from the model’s quantizers

decode(tokens, length=None)[source]

Decodes audio from tokens

Parameters:

tokens (torch.Tensor) – A (Batch x num_codebooks x Length) tensor of audio tokens
length (torch.Tensor) – A 1-D tensor of relative lengths

Returns:

audio – the reconstructed audio

Return type:

torch.Tensor