speechbrain.lobes.models.huggingface_transformers.discrete_ssl module

This lobe enables the integration of pretrained discrete SSL (hubert,wavlm,wav2vec) for extracting semnatic tokens from output of SSL layers.

Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

Author
  • Pooneh Mousavi 2024

  • Jarod Duret 2024

Summary

Classes:

DiscreteSSL

This lobe enables the integration of HuggingFace and SpeechBrain pretrained Discrete SSL models.

Reference

class speechbrain.lobes.models.huggingface_transformers.discrete_ssl.DiscreteSSL(save_path, ssl_model, kmeans_dataset, vocoder_repo_id='speechbrain/hifigan-wavlm-k1000-LibriTTS', num_clusters=1000, layers_num=None, device='cpu', sample_rate=16000)[source]

Bases: Module

This lobe enables the integration of HuggingFace and SpeechBrain pretrained Discrete SSL models.

Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

The model can be used as a fixed Discrete feature extractor or can be finetuned. It will download automatically the model from HuggingFace or use a local path.

The following table summarizes the compatible SSL models, their respective HF encoders, k-means training details, supported layers, and pretrained vocoder:

SSL Model | HF Encoder | K-Means Dataset | K-Means Size | SSL Layers | Vocoder Model |

|------------|β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”-|-----------------|————–|----------------------|β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”| | WavLM | microsoft/wavlm-large | LibriSpeech960 | 1000 | 1, 3, 7, 12, 18, 23 | speechbrain/hifigan-wavlm-k1000-LibriTTS | | HuBERT | facebook/hubert-large-ll60k | LibriSpeech960 | 1000 | 1, 3, 7, 12, 18, 23 | speechbrain/hifigan-hubert-k1000-LibriTTS | | Wav2Vec2 | facebook/wav2vec2-large | LibriSpeech960 | 1000 | 1, 3, 7, 12, 18, 23 | speechbrain/hifigan-wav2vec2-k1000-LibriTTS |

Parameters:
  • save_path (str) – Path (dir) of the downloaded model.

  • ssl_model (str) – SSL model to extract semantic tokens from its layers’ output. Note that output_all_hiddens should be set to True to enable multi-layer discretization.

  • kmeans_dataset (str) – Name of the dataset that Kmeans model on HF repo is trained with.

  • vocoder_repo_id (str) – Huggingface repository that contains the pre-trained HiFi-GAN model.

  • num_clusters (int or List[int] (default: 1000)) – Determine the number of clusters of the targeted kmeans models to be downloaded. It could be varying for each layer.

  • layers_num (List[int] (Optional)) – Detremine layers to be download from HF repo. If it is not provided, all layers with num_clusters(int) is loaded from HF repo. If num_clusters is a list, the layers_num should be provided to determine the cluster number for each layer.

  • device (str (default 'cpu')) – The device to use for computation (β€˜cpu’ or β€˜cuda’).

  • sample_rate (int (default: 16000)) – Sample rate of the input audio.

Example

>>> import torch
>>> from speechbrain.lobes.models.huggingface_transformers.wavlm import (WavLM)
>>> inputs = torch.rand([3, 2000])
>>> model_hub = "microsoft/wavlm-large"
>>> save_path = "savedir"
>>> ssl_layer_num = [7,23]
>>> deduplicate =[False, True]
>>> bpe_tokenizers=[None, None]
>>> vocoder_repo_id = "speechbrain/hifigan-wavlm-k1000-LibriTTS"
>>> kmeans_dataset = "LibriSpeech"
>>> num_clusters = 1000
>>> ssl_model = WavLM(model_hub, save_path,output_all_hiddens=True)
>>> model = DiscreteSSL(save_path, ssl_model, vocoder_repo_id=vocoder_repo_id, kmeans_dataset=kmeans_dataset,num_clusters=num_clusters)
>>> tokens, _, _ = model.encode(inputs,SSL_layers=ssl_layer_num, deduplicates=deduplicate, bpe_tokenizers=bpe_tokenizers)
>>> print(tokens.shape)
torch.Size([3, 6, 2])
>>> sig = model.decode(tokens, ssl_layer_num)
>>> print(sig.shape)
torch.Size([3, 1, 1920])
check_if_input_is_compatible(layers_num, num_clusters)[source]

check if layer_number and num_clusters is consistent with each other.

Parameters:
  • layers_num (List[int] (Optional)) – If num_clusters is a list, the layers_num should be provided to determine the cluster number for each layer.

  • num_clusters (int or List[int]) – determine the number of clusters of the targeted kmeans models to be downloaded. It could be varying for each layer.

load_kmeans(repo_id, kmeans_dataset, encoder_name, num_clusters, cache_dir, layers_num=None)[source]

Load a Pretrained kmeans model from HF.

Parameters:
  • repo_id (str) – The hugingface repo id that contains the model.

  • kmeans_dataset (str) – Name of the dataset that Kmeans model are trained with in HF repo that need to be downloaded.

  • encoder_name (str) – Name of the encoder for locating files.

  • num_clusters (int or List[int]) – determine the number of clusters of the targeted kmeans models to be downloaded. It could be varying for each layer.

  • cache_dir (str) – Path (dir) of the downloaded model.

  • layers_num (List[int] (Optional)) – If num_clusters is a list, the layers_num should be provided to determine the cluster number for each layer.

Returns:

  • kmeans_model (MiniBatchKMeans) – pretrained Kmeans model loaded from the HF.

  • layer_ids (List[int]) – supported layer nums for kmeans (extracted from the name of kmeans model.)

forward(wav, wav_lens=None, SSL_layers=None, deduplicates=None, bpe_tokenizers=None)[source]

Takes an input waveform and return its corresponding tokens and reconstructed signal.

Parameters:
  • wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.

  • wav_lens (tensor) – The relative length of the wav given in SpeechBrain format.

  • SSL_layers (List[int]:) – determine which layers of SSL should be used to extract information.

  • deduplicates (List[boolean]:) – determine to apply deduplication(remove duplicate subsequent tokens) on the tokens extracted for the corresponding layer.

  • bpe_tokenizers (List[int]:) – determine to apply subwording on the tokens extracted for the corresponding layer if the sentencePiece tokenizer is trained for that layer.

Returns:

  • tokens (torch.Tensor) – A (Batch x Seq x num_SSL_layers) tensor of audio tokens

  • waveforms (torch.tensor) – Batch of mel-waveforms [batch, time]

encode(wav, wav_lens=None, SSL_layers=None, deduplicates=None, bpe_tokenizers=None)[source]

Takes an input waveform and return its corresponding encoding.

Parameters:
  • wav (torch.Tensor (signal)) – A batch of audio signals to transform to features.

  • wav_lens (tensor) – The relative length of the wav given in SpeechBrain format.

  • SSL_layers (List[int]:) – determine which layers of SSL should be used to extract information.

  • deduplicates (List[boolean]:) – determine to apply deduplication(remove duplicate subsequent tokens) on the tokens extracted for the corresponding layer.

  • bpe_tokenizers (List[int]:) – determine to apply subwording on the tokens extracted for the corresponding layer if the sentencePiece tokenizer is trained for that layer.

Returns:

  • tokens (torch.Tensor) – A (Batch x Seq x num_SSL_layers) tensor of audio tokens

  • emb (torch.Tensor) – A (Batch x Seq x num_SSL_layers x embedding_dim ) cluster_centers embeddings for each tokens

  • processed_tokens (torch.Tensor) – A (Batch x Seq x num_SSL_layers) tensor of audio tokens after applying deduplication and subwording if necessary.

decode(tokens, SSL_layers=None)[source]

Takes an input waveform and return its corresponding waveform. Original source: https://github.com/speechbrain/benchmarks/blob/c87beb61d4747909a133d3e1b3a3df7c8eda1f08/ benchmarks/DASB/Libri2Mix/separation/conformer/train_discrete_ssl.py#L44

Parameters:
  • tokens (torch.Tensor) – A (Batch, codes, layers) tensor of discrete units

  • SSL_layers (List[int]:) – determine which layers of SSL should be used by the vocoder.

Returns:

waveforms – Batch of mel-waveforms [batch, time]

Return type:

torch.tensor