speechbrain.integrations.decoders.kenlm_scorer module

Language model wrapper for kenlm n-gram.

This file is based on the implementation of the kenLM wrapper from PyCTCDecode (see: https://github.com/kensho-technologies/pyctcdecode) and is used in CTC decoders.

See: speechbrain.decoders.ctc

Authors

Adel Moumen 2023
Peter Plantinga 2024

Summary

Classes:

`KenlmScorer`	KenLM language model container class to consolidate functionality.
`KenlmState`	Wrapper for kenlm state.

Functions:

`LanguageModel`	This function redirects users to the correct class name, printing a deprecation notice.
`load_unigram_set_from_arpa`	Read unigrams from arpa file.

Reference

speechbrain.integrations.decoders.kenlm_scorer.LanguageModel(*args, **kwargs)[source]

This function redirects users to the correct class name, printing a deprecation notice.

This can be removed once deprecation is complete.

speechbrain.integrations.decoders.kenlm_scorer.load_unigram_set_from_arpa(arpa_path: str) → Set[str][source]

Read unigrams from arpa file.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:: arpa_path (str) – Path to arpa file.
Returns:: unigrams – Set of unigrams.
Return type:: set

Example

>>> arpa_file = getfixture("tmpdir").join("bigram.arpa")
>>> arpa_file.write(
...     "Anything can be here\n"
...     + "\n"
...     + "\\data\\\n"
...     + "ngram 1=3\n"
...     + "ngram 2=4\n"
...     + "\n"
...     + "\\1-grams:\n"
...     + "0 <s>\n"
...     + "-0.6931 a 0.\n"
...     + "-0.6931 b 0.\n"
...     + ""  # Ends unigram section
...     + "\\2-grams:\n"
...     + "-0.6931 <s> a\n"
...     + "-0.6931 a a\n"
...     + "-0.6931 a b\n"
...     + "-0.6931 b a\n"
...     + "\n"  # Ends bigram section
...     + "\\end\\\n"
... )  # Ends whole file
>>> sorted(load_unigram_set_from_arpa(arpa_file))
['a', 'b']

class speechbrain.integrations.decoders.kenlm_scorer.KenlmState(state: State)[source]

Bases: object

Wrapper for kenlm state.

This is a wrapper for the kenlm state object. It is used to make sure that the state is not modified outside of the language model class.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:: state (kenlm.State) – Kenlm state object.

property state: State: Get the raw state object.

class speechbrain.integrations.decoders.kenlm_scorer.KenlmScorer(kenlm_model: Model, unigrams: Collection[str] | None = None, alpha: float = 0.5, beta: float = 1.5, unk_score_offset: float = -10.0, score_boundary: bool = True)[source]

Bases: object

KenLM language model container class to consolidate functionality.

This class is a wrapper around the KenLM language model. It provides functionality to score tokens and to get the initial state.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:

kenlm_model (kenlm.Model) – Kenlm model.
unigrams (list) – List of known word unigrams.
alpha (float) – Weight for language model during shallow fusion.
beta (float) – Weight for length score adjustment of during scoring.
unk_score_offset (float) – Amount of log score offset for unknown tokens.
score_boundary (bool) – Whether to have kenlm respect boundaries when scoring.

Example

>>> arpa_file = getfixture("tmpdir").join("bigram_hello.arpa")
>>> arpa_file.write(
...     "\\data\\\n"
...     + "ngram 1=4\n"
...     + "ngram 2=1\n\n"
...     + "\\1-grams:\n"
...     + "-1.0\t<s>\t-1.0\n"
...     + "-1.0\t</s>\t-1.0\n"
...     + "-1.0\tHello\t-0.23\n"
...     + "-0.7\tworld\t-0.25\n\n"
...     + "\\2-grams:\n"
...     + "-0.3\tHello world\n\n"
...     + "\\end\\"
... )
>>> model = kenlm.Model(str(arpa_file))
>>> scorer = KenlmScorer(kenlm_model=model, unigrams=["Hello", "world"])
>>> state = scorer.get_start_state()
>>> score, new_state = scorer.score(state, "Hello")
>>> round(score, 3)
-0.803

property order: int: Get the order of the n-gram language model.

get_start_state() → KenlmState[source]: Get initial lm state.

score_partial_token(partial_token: str) → float[source]: Get partial token score.

score(prev_state, word: str, is_last_word: bool = False) → Tuple[float, KenlmState][source]: Score word conditional on start state.