speechbrain.decoders.language_model moduleο
Language model wrapper for kenlm n-gram.
This file is based on the implementation of the kenLM wrapper from PyCTCDecode (see: https://github.com/kensho-technologies/pyctcdecode) and is used in CTC decoders.
See: speechbrain.decoders.ctc.py
- Authors
Adel Moumen 2023
Summaryο
Classes:
Wrapper for kenlm state. |
|
Language model container class to consolidate functionality. |
Functions:
Read unigrams from arpa file. |
Referenceο
- speechbrain.decoders.language_model.load_unigram_set_from_arpa(arpa_path: str) Set[str] [source]ο
Read unigrams from arpa file.
Taken from: https://github.com/kensho-technologies/pyctcdecode
- class speechbrain.decoders.language_model.KenlmState(state: State)[source]ο
Bases:
object
Wrapper for kenlm state.
This is a wrapper for the kenlm state object. It is used to make sure that the state is not modified outside of the language model class.
Taken from: https://github.com/kensho-technologies/pyctcdecode
- Parameters:
state (kenlm.State) β Kenlm state object.
- property state: Stateο
Get the raw state object.
- class speechbrain.decoders.language_model.LanguageModel(kenlm_model: Model, unigrams: Collection[str] | None = None, alpha: float = 0.5, beta: float = 1.5, unk_score_offset: float = -10.0, score_boundary: bool = True)[source]ο
Bases:
object
Language model container class to consolidate functionality.
This class is a wrapper around the kenlm language model. It provides functionality to score tokens and to get the initial state.
Taken from: https://github.com/kensho-technologies/pyctcdecode
- Parameters:
kenlm_model (kenlm.Model) β Kenlm model.
unigrams (list) β List of known word unigrams.
alpha (float) β Weight for language model during shallow fusion.
beta (float) β Weight for length score adjustment of during scoring.
unk_score_offset (float) β Amount of log score offset for unknown tokens.
score_boundary (bool) β Whether to have kenlm respect boundaries when scoring.
- get_start_state() KenlmState [source]ο
Get initial lm state.