speechbrain.integrations.huggingface.llama moduleο
This lobe enables the integration of huggingface pretrained LlaMA models.
- Authors
Titouan Parcollet 2025
Shucong Zhang 2025
Pooneh Mousavi 2023
Adel Moumen 2025
Summaryο
Classes:
This lobe enables the integration of HuggingFace pretrained LLaMA models. |
Referenceο
- class speechbrain.integrations.huggingface.llama.LLaMA(source: str, save_path: str, bnb_config: BitsAndBytesConfig = None, freeze: bool = False, pad_token: str = '[PAD]', torch_dtype: dtype = torch.float16, additional_special_tokens: List[str] = None, pad_to_multiple_of: int = 8, **kwargs)[source]ο
Bases:
HFTransformersInterfaceThis lobe enables the integration of HuggingFace pretrained LLaMA models.
The model can be finetuned entirely or coupled with SpeechBrain (and peft) adapters (see https://speechbrain.readthedocs.io/en/latest/tutorials/nn/neural-network-adapters.html)
Quantisation can be applied by passing a BitsAndBytesConfig which can be instantiated in a SpeechBrain yaml (or elsewhere.)
Transformer from HuggingFace needs to be installed: https://huggingface.co/transformers/installation.html
- Parameters:
source (str) β HuggingFace hub name: e.g βmeta-llama/Llama-2-7b-chat-hfβ
save_path (str) β Path (dir) of the downloaded model.
bnb_config (transformers.BitsAndBytesConfig) β BitsAndBytesConfig enabling quantisation of the model. If not specified, the model weights will be loaded with weight_precision_load dtype.
freeze (bool (default: false)) β If True, the model is frozen. If False, the model will be trained alongside with the rest of the pipeline.
pad_token (str (default: "[PAD]")) β String representation of the padding token. This may change from one model to another.
torch_dtype (torch.dtype (default: torch.float16)) β If no bnb_config is given, this parameter defines the loading type of the parameters of the model. This is useful to reduce memory footprint, but it does not change the compute dtype. For this just refer to mixed precision training in SpeechBrain.
additional_special_tokens (List[str], optional) β A list of additional special tokens to add to the tokenizer. These tokens will be added using the tokenizerβs
add_special_tokensmethod.pad_to_multiple_of (int (default: 8)) β The token embeddings will be resized to a multiple of this value. This is useful to maximise the use of tensor cores on modern GPUs.
**kwargs (dict) β Extra keyword arguments passed to the
from_pretrainedfunction. This can be used, for instance, to change the type of attention. The HuggingFace documentation gives the full dict of parameters which may be model dependent.
Example
>>> model_hub = "meta-llama/Llama-2-7b-chat-hf" >>> save_path = "savedir" >>> model = LLaMA(model_hub, save_path) >>> tokens = torch.tensor([[1, 1]]) >>> attention_mask = torch.tensor([[1, 1]]) >>> outputs = model(tokens, attention_mask)
- override_config(config)[source]ο
Users should modify this function according to their own tasks.
- Parameters:
config (HuggingFace config object) β The original config.
- Returns:
config β Overridden config.
- Return type:
HuggingFace config object
- forward(**kwargs)[source]ο
This function wraps the HuggingFace forward function. See the HuggingFace documentation of your Llama model of interest to know which parameters to pass, typically the input tokens or embeddings and attention masks.
- Parameters:
**kwargs (dict) β Please refer to HuggingFace documentation and map it to your Llama model of interest.
- Returns:
output β This depends on the Llama model. Please refer to the HuggingFace documentation.
- Return type:
- generate(**kwargs)[source]ο
This function wraps the HuggingFace generate function. See the HuggingFace documentation of your Llama model of interest to know which parameters to pass, typically the input tokens or embeddings, attention masks and a transformers.GenerationConfig.
- Parameters:
**kwargs (dict) β Please refer to HuggingFace documentation and map it to your Llama model of interest.
- Returns:
hyp β Contains tokenized (indices) outputs.
- Return type: