speechbrain.integrations.decoders.kenlm_scorer module

Language model wrapper for kenlm n-gram.

This file is based on the implementation of the kenLM wrapper from PyCTCDecode (see: https://github.com/kensho-technologies/pyctcdecode) and is used in CTC decoders.

See: speechbrain.decoders.ctc

Authors
  • Adel Moumen 2023

  • Peter Plantinga 2024

Summary

Classes:

KenlmScorer

KenLM language model container class to consolidate functionality.

KenlmState

Wrapper for kenlm state.

Functions:

LanguageModel

This function redirects users to the correct class name, printing a deprecation notice.

load_unigram_set_from_arpa

Read unigrams from arpa file.

Reference

speechbrain.integrations.decoders.kenlm_scorer.LanguageModel(*args, **kwargs)[source]

This function redirects users to the correct class name, printing a deprecation notice.

This can be removed once deprecation is complete.

speechbrain.integrations.decoders.kenlm_scorer.load_unigram_set_from_arpa(arpa_path: str) Set[str][source]

Read unigrams from arpa file.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:

arpa_path (str) – Path to arpa file.

Returns:

unigrams – Set of unigrams.

Return type:

set

Example

>>> arpa_file = getfixture("tmpdir").join("bigram.arpa")
>>> arpa_file.write(
...     "Anything can be here\n"
...     + "\n"
...     + "\\data\\\n"
...     + "ngram 1=3\n"
...     + "ngram 2=4\n"
...     + "\n"
...     + "\\1-grams:\n"
...     + "0 <s>\n"
...     + "-0.6931 a 0.\n"
...     + "-0.6931 b 0.\n"
...     + ""  # Ends unigram section
...     + "\\2-grams:\n"
...     + "-0.6931 <s> a\n"
...     + "-0.6931 a a\n"
...     + "-0.6931 a b\n"
...     + "-0.6931 b a\n"
...     + "\n"  # Ends bigram section
...     + "\\end\\\n"
... )  # Ends whole file
>>> sorted(load_unigram_set_from_arpa(arpa_file))
['a', 'b']
class speechbrain.integrations.decoders.kenlm_scorer.KenlmState(state: State)[source]

Bases: object

Wrapper for kenlm state.

This is a wrapper for the kenlm state object. It is used to make sure that the state is not modified outside of the language model class.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:

state (kenlm.State) – Kenlm state object.

property state: State

Get the raw state object.

class speechbrain.integrations.decoders.kenlm_scorer.KenlmScorer(kenlm_model: Model, unigrams: Collection[str] | None = None, alpha: float = 0.5, beta: float = 1.5, unk_score_offset: float = -10.0, score_boundary: bool = True)[source]

Bases: object

KenLM language model container class to consolidate functionality.

This class is a wrapper around the KenLM language model. It provides functionality to score tokens and to get the initial state.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:
  • kenlm_model (kenlm.Model) – Kenlm model.

  • unigrams (list) – List of known word unigrams.

  • alpha (float) – Weight for language model during shallow fusion.

  • beta (float) – Weight for length score adjustment of during scoring.

  • unk_score_offset (float) – Amount of log score offset for unknown tokens.

  • score_boundary (bool) – Whether to have kenlm respect boundaries when scoring.

Example

>>> arpa_file = getfixture("tmpdir").join("bigram_hello.arpa")
>>> arpa_file.write(
...     "\\data\\\n"
...     + "ngram 1=4\n"
...     + "ngram 2=1\n\n"
...     + "\\1-grams:\n"
...     + "-1.0\t<s>\t-1.0\n"
...     + "-1.0\t</s>\t-1.0\n"
...     + "-1.0\tHello\t-0.23\n"
...     + "-0.7\tworld\t-0.25\n\n"
...     + "\\2-grams:\n"
...     + "-0.3\tHello world\n\n"
...     + "\\end\\"
... )
>>> model = kenlm.Model(str(arpa_file))
>>> scorer = KenlmScorer(kenlm_model=model, unigrams=["Hello", "world"])
>>> state = scorer.get_start_state()
>>> score, new_state = scorer.score(state, "Hello")
>>> round(score, 3)
-0.803
property order: int

Get the order of the n-gram language model.

get_start_state() KenlmState[source]

Get initial lm state.

score_partial_token(partial_token: str) float[source]

Get partial token score.

score(prev_state, word: str, is_last_word: bool = False) Tuple[float, KenlmState][source]

Score word conditional on start state.