speechbrain.integrations.huggingface.w2v_bert module

This lobe enables the integration of HuggingFace pretrained w2v-bert-2.0 models.

Reference: https://arxiv.org/abs/2312.05187 Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

Authors
  • Maryem Bouziane 2025

  • Salima Mdhaffar 2025

  • Yannick Estève 2025

Summary

Classes:

W2VBert

This lobe enables the integration of HuggingFace and SpeechBrain pretrained w2v-bert-2.0 models.

Reference

class speechbrain.integrations.huggingface.w2v_bert.W2VBert(source: str, save_path: str, output_norm: bool = False, freeze: bool = True, freeze_feature_extractor: bool = False, apply_spec_augment: bool = False, output_all_hiddens: bool = False, sample_rate: int | None = None, **kwargs)[source]

Bases: HFTransformersInterface

This lobe enables the integration of HuggingFace and SpeechBrain pretrained w2v-bert-2.0 models.

Source paper w2v-BERT: https://arxiv.org/abs/2312.05187 Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

The model can be used as a fixed feature extractor or can be finetuned. It will download automatically the model from HuggingFace or use a local path.

Parameters:
  • source (str) – HuggingFace hub name or local path, e.g. “facebook/w2v-bert-2.0”.

  • save_path (str) – Path (dir) used to cache / save the model.

  • output_norm (bool (default: False)) – If True, a layer_norm is applied to the output features.

  • freeze (bool (default: True)) – If True, the model is frozen. If False, the model is trained alongside the rest of the pipeline.

  • freeze_feature_extractor (bool (default: False)) – When freeze is False and this flag is True, only the convolutional feature extractor is frozen.

  • apply_spec_augment (bool (default: False)) – If True, the internal SpecAugment of the HF model is enabled.

  • output_all_hiddens (bool (default: False)) – If True, the forward method outputs the hidden states from all transformer layers.

  • sample_rate (int or None (default: None)) – Expected sampling rate of the input waveforms. If None, the sampling rate is read from the HF feature extractor when available, otherwise it defaults to 16000.

  • **kwargs – Extra keyword arguments passed to the from_pretrained function.

Example

>>> inputs = torch.rand([2, 16000])
>>> model_hub = "facebook/w2v-bert-2.0"
>>> save_path = "savedir"
>>> model = W2VBert(model_hub, save_path)
>>> outputs = model(inputs)
forward(wav: Tensor, wav_lens: Tensor | None = None) Tensor[source]

Takes an input waveform and returns its corresponding w2v-BERT encoding.

Parameters:
  • wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.

  • wav_lens (torch.Tensor or None) – The relative length of the wav given in SpeechBrain format.

Returns:

w2v-BERT encoded features.

Return type:

torch.Tensor