Neural Architectures

πŸ”— Fine-tuning or using Whisper, wav2vec2, HuBERT and others with SpeechBrain and HuggingFace

Parcollet T. & Moumen A.

Dec. 2022

Difficulty: medium

Time: 20m

πŸ”— Google Colab

This tutorial describes how to combine (use and finetune) pretrained models coming from HuggingFace. Any wav2vec 2.0 / HuBERT / WavLM or Whisper model integrated to the transformers interface of HuggingFace can be then plugged to SpeechBrain to approach a speech-related task: automatic speech recognition, speaker recognition, spoken language understanding …

πŸ”— Neural Network Adapters for faster low-memory fine-tuning

Plantinga P.

Sept. 2024

Difficulty: easy

Time: 20m

πŸ”— Google Colab

This tutorial covers the SpeechBrain implementation of adapters such as LoRA. This includes how to integrate either SpeechBrain implemented adapters, custom adapters, and adapters from libraries such as PEFT into a pre-trained model.

πŸ”— Complex and Quaternion Neural Networks

Parcollet T.

Feb. 2021

Difficulty: medium

Time: 30min

πŸ”— Google Colab

This tutorial demonstrates how to use the SpeechBrain implementation of complex-valued and quaternion-valued neural networks for speech technologies. It covers the basics of highdimensional representations and the associated neural layers : Linear, Convolution, Recurrent and Normalisation.

πŸ”— Recurrent Neural Networks

Ravanelli M.

Feb. 2021

Difficulty: easy

Time: 30min

πŸ”— Google Colab

Recurrent Neural Networks (RNNs) offer a natural way to process sequences. This tutorial demonstrates how to use the SpeechBrain implementations of RNNs including LSTMs, GRU, RNN and LiGRU a specific recurrent cell designed for speech-related tasks. RNNs are at the core of many sequence to sequence models.

πŸ”— Streaming Speech Recognition with Conformers

de Langen S.

Sep. 2024

Difficulty: medium

Time: 60min+

πŸ”— Google Colab

Automatic Speech Recognition (ASR) models are often only designed to transcribe an entire large chunk of audio and are unsuitable for usecases like live stream transcription, which requires low-latency, long-form transcription.

This tutorial introduces the Dynamic Chunk Training approach and architectural changes you can apply to make the Conformer model streamable. It introduces the tooling for training and inference that SpeechBrain can provide for you. This might be a good starting point if you’re interested in training and understanding your own streaming models, or even if you want to explore improved streaming architectures.