speechbrain.lobes.models.discrete.wavtokenizer moduleο
This lobe enables the integration of pretrained WavTokenizer.
Note that you need to pip install git+https://github.com/Tomiinek/WavTokenizer
to use this module.
Repository: https://github.com/jishengpeng/WavTokenizer/ Paper: https://arxiv.org/abs/2408.16532
- Authors
Pooneh Mousavi 2024
Summaryο
Classes:
This lobe enables the integration of pretrained WavTokenizer model, a discrete codec models with single codebook for Audio Language Modeling. |
Referenceο
- class speechbrain.lobes.models.discrete.wavtokenizer.WavTokenizer(source, save_path=None, config='wavtokenizer_smalldata_frame40_3s_nq1_code4096_dim512_kmeans200_attn.yaml', checkpoint='WavTokenizer_small_600_24k_4096.ckpt', sample_rate=24000, freeze=True)[source]ο
Bases:
Module
This lobe enables the integration of pretrained WavTokenizer model, a discrete codec models with single codebook for Audio Language Modeling.
- Source paper:
You need to pip install
git+https://github.com/Tomiinek/WavTokenizer
to use this module.The code is adapted from the official WavTokenizer repository: https://github.com/jishengpeng/WavTokenizer/
- Parameters:
source (str) β A HuggingFace repository identifier or a path
save_path (str) β The location where the pretrained model will be saved
config (str) β The name of the HF config file.
checkpoint (str) β The name of the HF checkpoint file.
sample_rate (int (default: 24000)) β The audio sampling rate
freeze (bool) β whether the model will be frozen (e.g. not trainable if used as part of training another model)
Example
>>> model_hub = "novateur/WavTokenizer" >>> save_path = "savedir" >>> config="wavtokenizer_smalldata_frame40_3s_nq1_code4096_dim512_kmeans200_attn.yaml" >>> checkpoint="WavTokenizer_small_600_24k_4096.ckpt" >>> model = WavTokenizer(model_hub, save_path,config=config,checkpoint=checkpoint) >>> audio = torch.randn(4, 48000) >>> length = torch.tensor([1.0, .5, .75, 1.0]) >>> tokens, embs= model.encode(audio) >>> tokens.shape torch.Size([4, 1, 80]) >>> embs.shape torch.Size([4, 80, 512]) >>> rec = model.decode(tokens) >>> rec.shape torch.Size([4, 48000])
- forward(inputs)[source]ο
Encodes the input audio as tokens and embeddings and decodes audio from tokens
- Parameters:
inputs (torch.Tensor) β A (Batch x Samples) tensor of audio
- Returns:
tokens (torch.Tensor) β A (Batch x Tokens x Heads) tensor of audio tokens
emb (torch.Tensor) β Raw vector embeddings from the modelβs quantizers
audio (torch.Tensor) β the reconstructed audio
- encode(inputs)[source]ο
Encodes the input audio as tokens and embeddings
- Parameters:
inputs (torch.Tensor) β A (Batch x Samples) or (Batch x Channel x Samples) tensor of audio
- Returns:
tokens (torch.Tensor) β A (Batch x NQ x Length) tensor of audio tokens
emb (torch.Tensor) β Raw vector embeddings from the modelβs quantizers