speechbrain.integrations.audio_tokenizers.speechtokenizer_interface module

This lobe enables the integration of pretrained SpeechTokenizer.

Please, install speechtokenizer:: pip install speechtokenizer

Reference: https://arxiv.org/abs/2308.16692

Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

Author

Pooneh Mousavi 2023

Summary

Classes:

SpeechTokenizer

This lobe enables the integration of HuggingFace and SpeechBrain pretrained SpeechTokenizer.

Reference

class speechbrain.integrations.audio_tokenizers.speechtokenizer_interface.SpeechTokenizer(source, save_path, sample_rate=16000)[source]

Bases: Module

This lobe enables the integration of HuggingFace and SpeechBrain pretrained SpeechTokenizer.

Please, install speechtokenizer: pip install speechtokenizer

Source paper: https://arxiv.org/abs/2308.16692

The model can be used as a fixed Discrete feature extractor or can be finetuned. It will download automatically the model from HuggingFace or use a local path.

Parameters:

source (str) – HuggingFace hub name: e.g “fnlp/SpeechTokenizer”
save_path (str) – Path (dir) of the downloaded model.
sample_rate (int (default: 16000)) – The audio sampling rate

Example

>>> import torch
>>> inputs = torch.rand([10, 600])
>>> model_hub = "fnlp/SpeechTokenizer"
>>> save_path = "savedir"
>>> model = SpeechTokenizer(model_hub, save_path)
>>> tokens = model.encode(inputs)
>>> tokens.shape
torch.Size([8, 10, 2])
>>> wav = model.decode(tokens)
>>> wav.shape
torch.Size([10, 640])

forward(wav, wav_lens=None)[source]

Takes an input waveform and return its corresponding wav2vec encoding.

Parameters:

wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.
wav_lens (torch.Tensor) – The relative length of the wav given in SpeechBrain format.

Returns:

tokens – A (N_q, Batch x Seq) tensor of audio tokens

Return type:

torch.Tensor

encode(wav, wav_lens=None)[source]

Takes an input waveform and return its corresponding wav2vec encoding.

Parameters:

wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.
wav_lens (torch.Tensor) – The relative length of the wav given in SpeechBrain format.

Returns:

tokens – A (N_q, Batch x Seq) tensor of audio tokens

Return type:

torch.Tensor

decode(codes)[source]

Takes an input waveform and return its corresponding wav2vec encoding.

Parameters:: codes (torch.Tensor) – A (N_q, Batch x Seq) tensor of audio tokens
Returns:: wav – A batch of reconstructed audio signals.
Return type:: torch.Tensor (signal)