speechbrain.decoders.language_model module

Language model wrapper for kenlm n-gram.

This file is based on the implementation of the kenLM wrapper from PyCTCDecode (see: https://github.com/kensho-technologies/pyctcdecode) and is used in CTC decoders.

See: speechbrain.decoders.ctc.py

Authors
  • Adel Moumen 2023

Summary

Classes:

KenlmState

Wrapper for kenlm state.

LanguageModel

Language model container class to consolidate functionality.

Functions:

load_unigram_set_from_arpa

Read unigrams from arpa file.

Reference

speechbrain.decoders.language_model.load_unigram_set_from_arpa(arpa_path: str) Set[str][source]

Read unigrams from arpa file.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:

arpa_path (str) – Path to arpa file.

Returns:

unigrams – Set of unigrams.

Return type:

set

class speechbrain.decoders.language_model.KenlmState(state: State)[source]

Bases: object

Wrapper for kenlm state.

This is a wrapper for the kenlm state object. It is used to make sure that the state is not modified outside of the language model class.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:

state (kenlm.State) – Kenlm state object.

property state: State

Get the raw state object.

class speechbrain.decoders.language_model.LanguageModel(kenlm_model: Model, unigrams: Collection[str] | None = None, alpha: float = 0.5, beta: float = 1.5, unk_score_offset: float = -10.0, score_boundary: bool = True)[source]

Bases: object

Language model container class to consolidate functionality.

This class is a wrapper around the kenlm language model. It provides functionality to score tokens and to get the initial state.

Taken from: https://github.com/kensho-technologies/pyctcdecode

Parameters:
  • kenlm_model (kenlm.Model) – Kenlm model.

  • unigrams (list) – List of known word unigrams.

  • alpha (float) – Weight for language model during shallow fusion.

  • beta (float) – Weight for length score adjustment of during scoring.

  • unk_score_offset (float) – Amount of log score offset for unknown tokens.

  • score_boundary (bool) – Whether to have kenlm respect boundaries when scoring.

property order: int

Get the order of the n-gram language model.

get_start_state() KenlmState[source]

Get initial lm state.

score_partial_token(partial_token: str) float[source]

Get partial token score.

score(prev_state, word: str, is_last_word: bool = False) Tuple[float, KenlmState][source]

Score word conditional on start state.