speechbrain.decoders.language_model module

Language model wrapper for kenlm n-gram.

This file is based on the implementation of the kenLM wrapper from PyCTCDecode (see: https://github.com/kensho-technologies/pyctcdecode) and is used in CTC decoders.

See: speechbrain.decoders.ctc.py

Authors

Adel Moumen 2023

Summary

Classes:

`KenlmState`	Wrapper for kenlm state.
`LanguageModel`	Language model container class to consolidate functionality.

Functions:

load_unigram_set_from_arpa

Read unigrams from arpa file.

Reference

speechbrain.decoders.language_model.load_unigram_set_from_arpa(arpa_path: str) → Set[str][source]

Read unigrams from arpa file.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:: arpa_path (str) – Path to arpa file.
Returns:: unigrams – Set of unigrams.
Return type:: set

class speechbrain.decoders.language_model.KenlmState(state: State)[source]

Bases: object

Wrapper for kenlm state.

This is a wrapper for the kenlm state object. It is used to make sure that the state is not modified outside of the language model class.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:: state (kenlm.State) – Kenlm state object.

property state: State: Get the raw state object.

class speechbrain.decoders.language_model.LanguageModel(kenlm_model: Model, unigrams: Collection[str] | None = None, alpha: float = 0.5, beta: float = 1.5, unk_score_offset: float = -10.0, score_boundary: bool = True)[source]

Bases: object

Language model container class to consolidate functionality.

This class is a wrapper around the kenlm language model. It provides functionality to score tokens and to get the initial state.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:

kenlm_model (kenlm.Model) – Kenlm model.
unigrams (list) – List of known word unigrams.
alpha (float) – Weight for language model during shallow fusion.
beta (float) – Weight for length score adjustment of during scoring.
unk_score_offset (float) – Amount of log score offset for unknown tokens.
score_boundary (bool) – Whether to have kenlm respect boundaries when scoring.

property order: int: Get the order of the n-gram language model.

get_start_state() → KenlmState[source]: Get initial lm state.

score_partial_token(partial_token: str) → float[source]: Get partial token score.

score(prev_state, word: str, is_last_word: bool = False) → Tuple[float, KenlmState][source]: Score word conditional on start state.