speechbrain.integrations.huggingface.llama module

This lobe enables the integration of huggingface pretrained LlaMA models.

Authors
  • Titouan Parcollet 2025

  • Shucong Zhang 2025

  • Pooneh Mousavi 2023

  • Adel Moumen 2025

Summary

Classes:

LLaMA

This lobe enables the integration of HuggingFace pretrained LLaMA models.

Reference

class speechbrain.integrations.huggingface.llama.LLaMA(source: str, save_path: str, bnb_config: BitsAndBytesConfig = None, freeze: bool = False, pad_token: str = '[PAD]', torch_dtype: dtype = torch.float16, additional_special_tokens: List[str] = None, pad_to_multiple_of: int = 8, **kwargs)[source]

Bases: HFTransformersInterface

This lobe enables the integration of HuggingFace pretrained LLaMA models.

The model can be finetuned entirely or coupled with SpeechBrain (and peft) adapters (see https://speechbrain.readthedocs.io/en/latest/tutorials/nn/neural-network-adapters.html)

Quantisation can be applied by passing a BitsAndBytesConfig which can be instantiated in a SpeechBrain yaml (or elsewhere.)

Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html

Parameters:
  • source (str) – HuggingFace hub name: e.g β€œmeta-llama/Llama-2-7b-chat-hf”

  • save_path (str) – Path (dir) of the downloaded model.

  • bnb_config (transformers.BitsAndBytesConfig) – BitsAndBytesConfig enabling quantisation of the model. If not specified, the model weights will be loaded with weight_precision_load dtype.

  • freeze (bool (default: false)) – If True, the model is frozen. If False, the model will be trained alongside with the rest of the pipeline.

  • pad_token (str (default: "[PAD]")) – String representation of the padding token. This may change from one model to another.

  • torch_dtype (torch.dtype (default: torch.float16)) – If no bnb_config is given, this parameter defines the loading type of the parameters of the model. This is useful to reduce memory footprint, but it does not change the compute dtype. For this just refer to mixed precision training in SpeechBrain.

  • additional_special_tokens (List[str], optional) – A list of additional special tokens to add to the tokenizer. These tokens will be added using the tokenizer’s add_special_tokens method.

  • pad_to_multiple_of (int (default: 8)) – The token embeddings will be resized to a multiple of this value. This is useful to maximise the use of tensor cores on modern GPUs.

  • **kwargs (dict) – Extra keyword arguments passed to the from_pretrained function. This can be used, for instance, to change the type of attention. The HuggingFace documentation gives the full dict of parameters which may be model dependent.

Example

>>> model_hub = "meta-llama/Llama-2-7b-chat-hf"
>>> save_path = "savedir"
>>> model = LLaMA(model_hub, save_path)
>>> tokens = torch.tensor([[1, 1]])
>>> attention_mask = torch.tensor([[1, 1]])
>>> outputs = model(tokens, attention_mask)
override_config(config)[source]

Users should modify this function according to their own tasks.

Parameters:

config (HuggingFace config object) – The original config.

Returns:

config – Overridden config.

Return type:

HuggingFace config object

forward(**kwargs)[source]

This function wraps the HuggingFace forward function. See the HuggingFace documentation of your Llama model of interest to know which parameters to pass, typically the input tokens or embeddings and attention masks.

Parameters:

**kwargs (dict) – Please refer to HuggingFace documentation and map it to your Llama model of interest.

Returns:

output – This depends on the Llama model. Please refer to the HuggingFace documentation.

Return type:

torch.Tensor

generate(**kwargs)[source]

This function wraps the HuggingFace generate function. See the HuggingFace documentation of your Llama model of interest to know which parameters to pass, typically the input tokens or embeddings, attention masks and a transformers.GenerationConfig.

Parameters:

**kwargs (dict) – Please refer to HuggingFace documentation and map it to your Llama model of interest.

Returns:

hyp – Contains tokenized (indices) outputs.

Return type:

torch.Tensor