speechbrain.integrations.decoders.kenlm_scorer moduleο
Language model wrapper for kenlm n-gram.
This file is based on the implementation of the kenLM wrapper from PyCTCDecode (see: https://github.com/kensho-technologies/pyctcdecode) and is used in CTC decoders.
See: speechbrain.decoders.ctc
- Authors
Adel Moumen 2023
Peter Plantinga 2024
Summaryο
Classes:
KenLM language model container class to consolidate functionality. |
|
Wrapper for kenlm state. |
Functions:
This function redirects users to the correct class name, printing a deprecation notice. |
|
Read unigrams from arpa file. |
Referenceο
- speechbrain.integrations.decoders.kenlm_scorer.LanguageModel(*args, **kwargs)[source]ο
This function redirects users to the correct class name, printing a deprecation notice.
This can be removed once deprecation is complete.
- speechbrain.integrations.decoders.kenlm_scorer.load_unigram_set_from_arpa(arpa_path: str) Set[str][source]ο
Read unigrams from arpa file.
Taken from: https://github.com/kensho-technologies/pyctcdecode
- Parameters:
arpa_path (str) β Path to arpa file.
- Returns:
unigrams β Set of unigrams.
- Return type:
Example
>>> arpa_file = getfixture("tmpdir").join("bigram.arpa") >>> arpa_file.write( ... "Anything can be here\n" ... + "\n" ... + "\\data\\\n" ... + "ngram 1=3\n" ... + "ngram 2=4\n" ... + "\n" ... + "\\1-grams:\n" ... + "0 <s>\n" ... + "-0.6931 a 0.\n" ... + "-0.6931 b 0.\n" ... + "" # Ends unigram section ... + "\\2-grams:\n" ... + "-0.6931 <s> a\n" ... + "-0.6931 a a\n" ... + "-0.6931 a b\n" ... + "-0.6931 b a\n" ... + "\n" # Ends bigram section ... + "\\end\\\n" ... ) # Ends whole file >>> sorted(load_unigram_set_from_arpa(arpa_file)) ['a', 'b']
- class speechbrain.integrations.decoders.kenlm_scorer.KenlmState(state: State)[source]ο
Bases:
objectWrapper for kenlm state.
This is a wrapper for the kenlm state object. It is used to make sure that the state is not modified outside of the language model class.
Taken from: https://github.com/kensho-technologies/pyctcdecode
- Parameters:
state (kenlm.State) β Kenlm state object.
- property state: Stateο
Get the raw state object.
- class speechbrain.integrations.decoders.kenlm_scorer.KenlmScorer(kenlm_model: Model, unigrams: Collection[str] | None = None, alpha: float = 0.5, beta: float = 1.5, unk_score_offset: float = -10.0, score_boundary: bool = True)[source]ο
Bases:
objectKenLM language model container class to consolidate functionality.
This class is a wrapper around the KenLM language model. It provides functionality to score tokens and to get the initial state.
Taken from: https://github.com/kensho-technologies/pyctcdecode
- Parameters:
kenlm_model (kenlm.Model) β Kenlm model.
unigrams (list) β List of known word unigrams.
alpha (float) β Weight for language model during shallow fusion.
beta (float) β Weight for length score adjustment of during scoring.
unk_score_offset (float) β Amount of log score offset for unknown tokens.
score_boundary (bool) β Whether to have kenlm respect boundaries when scoring.
Example
>>> arpa_file = getfixture("tmpdir").join("bigram_hello.arpa") >>> arpa_file.write( ... "\\data\\\n" ... + "ngram 1=4\n" ... + "ngram 2=1\n\n" ... + "\\1-grams:\n" ... + "-1.0\t<s>\t-1.0\n" ... + "-1.0\t</s>\t-1.0\n" ... + "-1.0\tHello\t-0.23\n" ... + "-0.7\tworld\t-0.25\n\n" ... + "\\2-grams:\n" ... + "-0.3\tHello world\n\n" ... + "\\end\\" ... ) >>> model = kenlm.Model(str(arpa_file)) >>> scorer = KenlmScorer(kenlm_model=model, unigrams=["Hello", "world"]) >>> state = scorer.get_start_state() >>> score, new_state = scorer.score(state, "Hello") >>> round(score, 3) -0.803
- get_start_state() KenlmState[source]ο
Get initial lm state.