speechbrain.k2_integration.prepare_lang module

This module contains functions to prepare the lexicon and the language model for k2 training. It is based on the script prepare_lang.sh from k2/icefall (work of Fangjun Kuang). The original script is under Apache 2.0 license. This script is modified to work with SpeechBrain.

Modified by:

Pierre Champion 2023
Zeyu Zhao 2023
Georgios Karakasidis 2023

Summary

Functions:

`add_disambig_symbols`	It adds pseudo-token disambiguation symbols #1, #2 and so on at the ends of tokens to ensure that all pronunciations are different, and that none is a prefix of another.
`add_self_loops`	Adds self-loops to states of an FST to propagate disambiguation symbols through it.
`generate_id_map`	Generate ID maps, i.e., map a symbol to a unique ID.
`get_tokens`	Get tokens from a lexicon.
`get_words`	Get words from a lexicon.
`lexicon_to_fst`	Convert a lexicon to an FST (in k2 format) with optional silence at the beginning and end of each word.
`lexicon_to_fst_no_sil`	Convert a lexicon to an FST (in k2 format).
`prepare_lang`	This function takes as input a lexicon file "$lang_dir/lexicon.txt" consisting of words and tokens (i.e., phones) and does the following:
`write_mapping`	Write a symbol to ID mapping to a file.

Reference

speechbrain.k2_integration.prepare_lang.write_mapping(filename: str | Path, sym2id: Dict[str, int]) → None[source]

Write a symbol to ID mapping to a file.

NOTE: No need to implement read_mapping as it can be done through: k2.SymbolTable.from_file().

Parameters:

filename (str) – Filename to save the mapping.
sym2id (Dict[str, int]) – A dict mapping symbols to IDs.

speechbrain.k2_integration.prepare_lang.get_tokens(lexicon: List[Tuple[str, List[str]]], sil_token='SIL', manually_add_sil_to_tokens=False) → List[str][source]

Get tokens from a lexicon.

Parameters:

lexicon (Lexicon) – It is the return value of read_lexicon().
sil_token (str) – The optional silence token between words. It should not appear in the lexicon, otherwise it will cause an error.
manually_add_sil_to_tokens (bool) – If true, add sil_token to the tokens. This is useful when the lexicon does not contain sil_token but it is needed in the tokens.

Returns:

sorted_ans – A list of unique tokens.

Return type:

List[str]

speechbrain.k2_integration.prepare_lang.get_words(lexicon: List[Tuple[str, List[str]]]) → List[str][source]

Get words from a lexicon.

Parameters:: lexicon (Lexicon) – It is the return value of read_lexicon().
Returns:: Return a list of unique words.
Return type:: sorted_ans

speechbrain.k2_integration.prepare_lang.add_disambig_symbols(lexicon: List[Tuple[str, List[str]]]) → Tuple[List[Tuple[str, List[str]]], int][source]

It adds pseudo-token disambiguation symbols #1, #2 and so on at the ends of tokens to ensure that all pronunciations are different, and that none is a prefix of another.

See also add_lex_disambig.pl from kaldi.

Parameters:

lexicon (Lexicon) – It is returned by read_lexicon().

Returns:

ans – The output lexicon with disambiguation symbols
max_disambig – The ID of the max disambiguation symbol that appears in the lexicon

speechbrain.k2_integration.prepare_lang.generate_id_map(symbols: List[str]) → Dict[str, int][source]

Generate ID maps, i.e., map a symbol to a unique ID.

Parameters:: symbols (List[str]) – A list of unique symbols.
Return type:: A dict containing the mapping between symbols and IDs.

speechbrain.k2_integration.prepare_lang.add_self_loops(arcs: List[List[Any]], disambig_token: int, disambig_word: int) → List[List[Any]][source]

Adds self-loops to states of an FST to propagate disambiguation symbols through it. They are added on each state with non-epsilon output symbols on at least one arc out of the state.

See also fstaddselfloops.pl from Kaldi. One difference is that Kaldi uses OpenFst style FSTs and it has multiple final states. This function uses k2 style FSTs and it does not need to add self-loops to the final state.

The input label of a self-loop is disambig_token, while the output label is disambig_word.

Parameters:

arcs (List[List[Any]]) – A list-of-list. The sublist contains [src_state, dest_state, label, aux_label, score]
disambig_token (int) – It is the token ID of the symbol #0.
disambig_word (int) – It is the word ID of the symbol #0.

Return type:

Return new arcs containing self-loops.

speechbrain.k2_integration.prepare_lang.lexicon_to_fst(lexicon: List[Tuple[str, List[str]]], token2id: Dict[str, int], word2id: Dict[str, int], sil_token: str = 'SIL', sil_prob: float = 0.5, need_self_loops: bool = False) → Fsa[source]

Convert a lexicon to an FST (in k2 format) with optional silence at the beginning and end of each word.

Parameters:

lexicon (Lexicon) – The input lexicon. See also read_lexicon()
token2id (Dict[str, int]) – A dict mapping tokens to IDs.
word2id (Dict[str, int]) – A dict mapping words to IDs.
sil_token (str) – The silence token.
sil_prob (float) – The probability for adding a silence at the beginning and end of the word.
need_self_loops (bool) – If True, add self-loop to states with non-epsilon output symbols on at least one arc out of the state. The input label for this self loop is token2id["#0"] and the output label is word2id["#0"].

Returns:

fsa – An FSA representing the given lexicon.

Return type:

k2.Fsa

speechbrain.k2_integration.prepare_lang.lexicon_to_fst_no_sil(lexicon: List[Tuple[str, List[str]]], token2id: Dict[str, int], word2id: Dict[str, int], need_self_loops: bool = False) → Fsa[source]

Convert a lexicon to an FST (in k2 format).

Parameters:

lexicon (Lexicon) – The input lexicon. See also read_lexicon()
token2id (Dict[str, int]) – A dict mapping tokens to IDs.
word2id (Dict[str, int]) – A dict mapping words to IDs.
need_self_loops (bool) – If True, add self-loop to states with non-epsilon output symbols on at least one arc out of the state. The input label for this self loop is token2id["#0"] and the output label is word2id["#0"].

Returns:

fsa – An FSA representing the given lexicon.

Return type:

k2.Fsa

speechbrain.k2_integration.prepare_lang.prepare_lang(lang_dir, sil_token='SIL', sil_prob=0.5, cache=True)[source]

This function takes as input a lexicon file “$lang_dir/lexicon.txt” consisting of words and tokens (i.e., phones) and does the following:

Add disambiguation symbols to the lexicon and generate lexicon_disambig.txt
Generate tokens.txt, the token table mapping a token to a unique integer.
Generate words.txt, the word table mapping a word to a unique integer.
Generate L.pt, in k2 format. It can be loaded by

d = torch.load(“L.pt”) lexicon = k2.Fsa.from_dict(d)
Generate L_disambig.pt, in k2 format.

Parameters:

lang_dir (str) – The directory to store the output files and read the input file lexicon.txt.
sil_token (str) – The silence token. Default is “SIL”.
sil_prob (float) – The probability for adding a silence at the beginning and end of the word. Default is 0.5.
cache (bool) – Whether or not to load/cache from/to the .pt format.

Example

>>> from speechbrain.k2_integration.prepare_lang import prepare_lang

>>> # Create a small lexicon containing only two words and write it to a file.
>>> lang_tmpdir = getfixture('tmpdir')
>>> lexicon_sample = '''hello h e l l o\nworld w o r l d'''
>>> lexicon_file = lang_tmpdir.join("lexicon.txt")
>>> lexicon_file.write(lexicon_sample)

>>> prepare_lang(lang_tmpdir)
>>> for expected_file in ["tokens.txt", "words.txt", "L.pt", "L_disambig.pt", "Linv.pt" ]:
...     assert os.path.exists(os.path.join(lang_tmpdir, expected_file))