speechbrain.integrations.huggingface.w2v_bert module
This lobe enables the integration of HuggingFace pretrained w2v-bert-2.0 models.
Reference: https://arxiv.org/abs/2312.05187 Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html
- Authors
Maryem Bouziane 2025
Salima Mdhaffar 2025
Yannick Estève 2025
Summary
Classes:
This lobe enables the integration of HuggingFace and SpeechBrain pretrained w2v-bert-2.0 models. |
Reference
- class speechbrain.integrations.huggingface.w2v_bert.W2VBert(source: str, save_path: str, output_norm: bool = False, freeze: bool = True, freeze_feature_extractor: bool = False, apply_spec_augment: bool = False, output_all_hiddens: bool = False, sample_rate: int | None = None, **kwargs)[source]
Bases:
HFTransformersInterfaceThis lobe enables the integration of HuggingFace and SpeechBrain pretrained w2v-bert-2.0 models.
Source paper w2v-BERT: https://arxiv.org/abs/2312.05187 Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html
The model can be used as a fixed feature extractor or can be finetuned. It will download automatically the model from HuggingFace or use a local path.
- Parameters:
source (str) – HuggingFace hub name or local path, e.g. “facebook/w2v-bert-2.0”.
save_path (str) – Path (dir) used to cache / save the model.
output_norm (bool (default: False)) – If True, a layer_norm is applied to the output features.
freeze (bool (default: True)) – If True, the model is frozen. If False, the model is trained alongside the rest of the pipeline.
freeze_feature_extractor (bool (default: False)) – When
freezeis False and this flag is True, only the convolutional feature extractor is frozen.apply_spec_augment (bool (default: False)) – If True, the internal SpecAugment of the HF model is enabled.
output_all_hiddens (bool (default: False)) – If True, the forward method outputs the hidden states from all transformer layers.
sample_rate (int or None (default: None)) – Expected sampling rate of the input waveforms. If None, the sampling rate is read from the HF feature extractor when available, otherwise it defaults to 16000.
**kwargs – Extra keyword arguments passed to the
from_pretrainedfunction.
Example
>>> inputs = torch.rand([2, 16000]) >>> model_hub = "facebook/w2v-bert-2.0" >>> save_path = "savedir" >>> model = W2VBert(model_hub, save_path) >>> outputs = model(inputs)
- forward(wav: Tensor, wav_lens: Tensor | None = None) Tensor[source]
Takes an input waveform and returns its corresponding w2v-BERT encoding.
- Parameters:
wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.
wav_lens (torch.Tensor or None) – The relative length of the wav given in SpeechBrain format.
- Returns:
w2v-BERT encoded features.
- Return type: