speechbrain.k2_integration.prepare_lang moduleο
This module contains functions to prepare the lexicon and the language model
for k2 training. It is based on the script prepare_lang.sh
from k2/icefall (work
of Fangjun Kuang). The original script is under Apache 2.0 license.
This script is modified to work with SpeechBrain.
- Modified by:
Pierre Champion 2023
Zeyu Zhao 2023
Georgios Karakasidis 2023
Summaryο
Functions:
It adds pseudo-token disambiguation symbols #1, #2 and so on at the ends of tokens to ensure that all pronunciations are different, and that none is a prefix of another. |
|
Adds self-loops to states of an FST to propagate disambiguation symbols through it. |
|
Generate ID maps, i.e., map a symbol to a unique ID. |
|
Get tokens from a lexicon. |
|
Get words from a lexicon. |
|
Convert a lexicon to an FST (in k2 format) with optional silence at the beginning and end of each word. |
|
Convert a lexicon to an FST (in k2 format). |
|
This function takes as input a lexicon file "$lang_dir/lexicon.txt" consisting of words and tokens (i.e., phones) and does the following: |
|
Write a symbol to ID mapping to a file. |
Referenceο
- speechbrain.k2_integration.prepare_lang.write_mapping(filename: str | Path, sym2id: Dict[str, int]) None [source]ο
Write a symbol to ID mapping to a file.
- NOTE: No need to implement
read_mapping
as it can be done through k2.SymbolTable.from_file()
.
- NOTE: No need to implement
- speechbrain.k2_integration.prepare_lang.get_tokens(lexicon: List[Tuple[str, List[str]]], sil_token='SIL', manually_add_sil_to_tokens=False) List[str] [source]ο
Get tokens from a lexicon.
- Parameters:
lexicon (Lexicon) β It is the return value of
read_lexicon()
.sil_token (str) β The optional silence token between words. It should not appear in the lexicon, otherwise it will cause an error.
manually_add_sil_to_tokens (bool) β If true, add
sil_token
to the tokens. This is useful when the lexicon does not containsil_token
but it is needed in the tokens.
- Returns:
sorted_ans β A list of unique tokens.
- Return type:
List[str]
- speechbrain.k2_integration.prepare_lang.get_words(lexicon: List[Tuple[str, List[str]]]) List[str] [source]ο
Get words from a lexicon.
- Parameters:
lexicon (Lexicon) β It is the return value of
read_lexicon()
.- Returns:
Return a list of unique words.
- Return type:
sorted_ans
- speechbrain.k2_integration.prepare_lang.add_disambig_symbols(lexicon: List[Tuple[str, List[str]]]) Tuple[List[Tuple[str, List[str]]], int] [source]ο
It adds pseudo-token disambiguation symbols #1, #2 and so on at the ends of tokens to ensure that all pronunciations are different, and that none is a prefix of another.
See also add_lex_disambig.pl from kaldi.
- Parameters:
lexicon (Lexicon) β It is returned by
read_lexicon()
.- Returns:
ans β The output lexicon with disambiguation symbols
max_disambig β The ID of the max disambiguation symbol that appears in the lexicon
- speechbrain.k2_integration.prepare_lang.generate_id_map(symbols: List[str]) Dict[str, int] [source]ο
Generate ID maps, i.e., map a symbol to a unique ID.
- Parameters:
symbols (List[str]) β A list of unique symbols.
- Return type:
A dict containing the mapping between symbols and IDs.
- speechbrain.k2_integration.prepare_lang.add_self_loops(arcs: List[List[Any]], disambig_token: int, disambig_word: int) List[List[Any]] [source]ο
Adds self-loops to states of an FST to propagate disambiguation symbols through it. They are added on each state with non-epsilon output symbols on at least one arc out of the state.
See also fstaddselfloops.pl from Kaldi. One difference is that Kaldi uses OpenFst style FSTs and it has multiple final states. This function uses k2 style FSTs and it does not need to add self-loops to the final state.
The input label of a self-loop is
disambig_token
, while the output label isdisambig_word
.
- speechbrain.k2_integration.prepare_lang.lexicon_to_fst(lexicon: List[Tuple[str, List[str]]], token2id: Dict[str, int], word2id: Dict[str, int], sil_token: str = 'SIL', sil_prob: float = 0.5, need_self_loops: bool = False) k2.Fsa [source]ο
Convert a lexicon to an FST (in k2 format) with optional silence at the beginning and end of each word.
- Parameters:
lexicon (Lexicon) β The input lexicon. See also
read_lexicon()
sil_token (str) β The silence token.
sil_prob (float) β The probability for adding a silence at the beginning and end of the word.
need_self_loops (bool) β If True, add self-loop to states with non-epsilon output symbols on at least one arc out of the state. The input label for this self loop is
token2id["#0"]
and the output label isword2id["#0"]
.
- Returns:
fsa β An FSA representing the given lexicon.
- Return type:
k2.Fsa
- speechbrain.k2_integration.prepare_lang.lexicon_to_fst_no_sil(lexicon: List[Tuple[str, List[str]]], token2id: Dict[str, int], word2id: Dict[str, int], need_self_loops: bool = False) k2.Fsa [source]ο
Convert a lexicon to an FST (in k2 format).
- Parameters:
- Returns:
fsa β An FSA representing the given lexicon.
- Return type:
k2.Fsa
- speechbrain.k2_integration.prepare_lang.prepare_lang(lang_dir, sil_token='SIL', sil_prob=0.5, cache=True)[source]ο
This function takes as input a lexicon file β$lang_dir/lexicon.txtβ consisting of words and tokens (i.e., phones) and does the following:
Add disambiguation symbols to the lexicon and generate lexicon_disambig.txt
Generate tokens.txt, the token table mapping a token to a unique integer.
Generate words.txt, the word table mapping a word to a unique integer.
Generate L.pt, in k2 format. It can be loaded by
d = torch.load(βL.ptβ) lexicon = k2.Fsa.from_dict(d)
Generate L_disambig.pt, in k2 format.
- Parameters:
lang_dir (str) β The directory to store the output files and read the input file lexicon.txt.
sil_token (str) β The silence token. Default is βSILβ.
sil_prob (float) β The probability for adding a silence at the beginning and end of the word. Default is 0.5.
cache (bool) β Whether or not to load/cache from/to the .pt format.
- Return type:
None
Example
>>> from speechbrain.k2_integration.prepare_lang import prepare_lang
>>> # Create a small lexicon containing only two words and write it to a file. >>> lang_tmpdir = getfixture('tmpdir') >>> lexicon_sample = '''hello h e l l o\nworld w o r l d''' >>> lexicon_file = lang_tmpdir.join("lexicon.txt") >>> lexicon_file.write(lexicon_sample)
>>> prepare_lang(lang_tmpdir) >>> for expected_file in ["tokens.txt", "words.txt", "L.pt", "L_disambig.pt", "Linv.pt" ]: ... assert os.path.exists(os.path.join(lang_tmpdir, expected_file))