speechbrain.lobes.models.huggingface_transformers.whisper module

This lobe enables the integration of huggingface pretrained whisper model.

Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

Authors

Adel Moumen 2022
Titouan Parcollet 2022
Luca Della Libera 2022
Ha Nguyen 2023

Summary

Classes:

Whisper

This lobe enables the integration of HuggingFace pretrained Whisper model.

Reference

class speechbrain.lobes.models.huggingface_transformers.whisper.Whisper(source, save_path, sampling_rate=16000, encoder_only=False, freeze=False, freeze_encoder=False, output_attentions=True, output_all_hiddens=False)[source]

Bases: HFTransformersInterface

This lobe enables the integration of HuggingFace pretrained Whisper model.

Source paper whisper:: https://cdn.openai.com/papers/whisper.pdf

Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

Some part of the code also cis adapted from the official OpenAI repository: https://github.com/openai/whisper

The model can be finetuned. It will download automatically the model from HuggingFace or use a local path.

Parameters:

source (str) – HuggingFace hub name: e.g “openai/whisper-tiny”
save_path (str) – Path (dir) of the downloaded model.
sampling_rate (int (default: 16000)) – Sampling rate of the audio signal.
encoder_only (bool (default: False)) – If True, the forward function outputs the hidden states from the last transformer layer of the encoder. If False, one step of the decoder is performed and returned.
freeze (bool (default: False)) – If True, the model is frozen.
freeze_encoder (bool (default: False)) – If True, the encoder is frozen.
output_attentions (bool (default: True)) – If True, the forward function outputs the attention weights.
output_all_hiddens (bool (default: False)) – If True, the forward function outputs the hidden states from all transformer layers of the encoder. For example whisper-base has 6 transformer layers and the output is of shape (7, B, T, C), where the output of the CNN output is added to the beginning. If False, the forward function outputs the hidden states only from the last transformer layer of the encoder.

Example

>>> model_hub = "openai/whisper-tiny"
>>> save_path = "savedir"
>>> sampling_rate = 16000
>>> model = Whisper(model_hub, save_path, sampling_rate)
>>> tokens = torch.tensor([[1, 1]]) * model.model.config.decoder_start_token_id
>>> inputs = torch.randn([1, 93680])
>>> outputs = model(inputs, tokens)

freeze_model(model)[source]

Freezes parameters of a model.

Parameters:: model (from AutoModel.from_config) – Valid HuggingFace transformers model object.

forward(wav, decoder_input_ids=None)[source]

Perform mel transformation and one step of the whisper (encoder-decoder).

Parameters:

wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.
decoder_input_ids (torch.Tensor) –
This is necessary if we want to use the decoder.

A batch of decoder inputs tokens. The first tokens need to dictacte the behavior of the decoder. It needs to start with the bos_token, the language token, the task token, and finally the timestamp token.

Please refer to the whisper paper for more details or go to the seq2seq2.py file in SpeechBrain to see how to generate the tokens with Greedy Search and/or Beam Search.

forward_encoder(wav)[source]

Perform one step of the whisper encoder with Mel FBANKs as Input.

Parameters:: wav (torch.Tensor (FBANKs)) – A batch of Mel FBANK from HF to transform to features.

forward_decoder(audio_features, decoder_input_ids)[source]

Perform one step of the whisper decoder.

Parameters:

audio_features (torch.Tensor) – A batch of audio features (mel + whisper encoding).
decoder_input_ids (torch.Tensor) –
A batch of decoder inputs tokens. The first tokens need to dictacte the behavior of the decoder. It needs to start with the bos_token, the language token, the task token, and finally the timestamp token.

Please refer to the whisper paper for more details or go to the seq2seq2.py file in SpeechBrain to see how to generate the tokens with Greedy Search and/or Beam Search.

training: bool