speechbrain.k2_integration.lexicon module

Lexicon class and utilities. Provides functions to read/write lexicon files and convert them to k2 ragged tensors. The Lexicon class provides a way to convert a list of words to a ragged tensor containing token IDs. It also stores the lexicon graph which can be used by a graph compiler to decode sequences.

This code was adjusted, and therefore heavily inspired or taken from from icefall’s (https://github.com/k2-fsa/icefall) Lexicon class and its utility functions.

Authors:
  • Pierre Champion 2023

  • Zeyu Zhao 2023

  • Georgios Karakasidis 2023

Summary

Classes:

Lexicon

Unit based lexicon.

Functions:

prepare_char_lexicon

Read extra_csv_files to generate a $lang_dir/lexicon.txt for k2 training.

read_lexicon

Read a lexicon from filename.

write_lexicon

Write a lexicon to a file.

Reference

class speechbrain.k2_integration.lexicon.Lexicon(lang_dir: Path)[source]

Bases: object

Unit based lexicon. It is used to map a list of words to each word’s sequence of tokens (characters). It also stores the lexicon graph which can be used by a graph compiler to decode sequences.

Parameters:

lang_dir (str) –

Path to the lang directory. It is expected to contain the following files:

  • tokens.txt

  • words.txt

  • L.pt

Example

>>> from speechbrain.k2_integration import k2
>>> from speechbrain.k2_integration.lexicon import Lexicon
>>> from speechbrain.k2_integration.graph_compiler import CtcGraphCompiler
>>> from speechbrain.k2_integration.prepare_lang import prepare_lang
>>> # Create a small lexicon containing only two words and write it to a file.
>>> lang_tmpdir = getfixture('tmpdir')
>>> lexicon_sample = '''hello h e l l o\nworld w o r l d'''
>>> lexicon_file = lang_tmpdir.join("lexicon.txt")
>>> lexicon_file.write(lexicon_sample)
>>> # Create a lang directory with the lexicon and L.pt, L_inv.pt, L_disambig.pt
>>> prepare_lang(lang_tmpdir)
>>> # Create a lexicon object
>>> lexicon = Lexicon(lang_tmpdir)
>>> # Make sure the lexicon was loaded correctly
>>> assert isinstance(lexicon.token_table, k2.SymbolTable)
>>> assert isinstance(lexicon.L, k2.Fsa)
property tokens: List[int]

Return a list of token IDs excluding those from disambiguation symbols and epsilon.

property L_disambig: Fsa

Return the lexicon FSA (with disambiguation symbols). Needed for HLG construction.

remove_G_rescoring_disambig_symbols(G: Fsa)[source]

Remove the disambiguation symbols of a G graph

Parameters:

G (k2.Fsa) – The G graph to be modified

remove_LG_disambig_symbols(LG: Fsa) Fsa[source]

Remove the disambiguation symbols of an LG graph Needed for HLG construction.

Parameters:

LG (k2.Fsa) – The LG graph to be modified

texts_to_word_ids(texts: List[str], add_sil_token_as_separator=False, sil_token_id: int | None = None, log_unknown_warning=True) List[List[int]][source]

Convert a list of texts into word IDs.

This method performs the mapping of each word in the input texts to its corresponding ID. The result is a list of lists, where each inner list contains the word IDs for a sentence. If the add_sil_token_as_separator flag is True, a silence token is inserted between words, and the sil_token_id parameter specifies the ID for the silence token. If a word is not found in the vocabulary, a warning is logged if log_unknown_warning is True.

Parameters:
  • texts (List[str]) – A list of strings where each string represents a sentence. Each sentence is composed of space-separated words.

  • add_sil_token_as_separator (bool) – Flag indicating whether to add a silence token as a separator between words.

  • sil_token_id (Optional[int]) – The ID of the silence token. If not provided, the separator is not added.

  • log_unknown_warning (bool) – Flag indicating whether to log a warning for unknown words.

Returns:

word_ids – A list of lists where each inner list represents the word IDs for a sentence. The word IDs are obtained based on the vocabulary mapping.

Return type:

List[List[int]]

texts_to_token_ids(texts: List[str], log_unknown_warning=True) List[List[List[int]]][source]

Convert a list of text sentences into token IDs.

Parameters:
  • texts (List[str]) –

    A list of strings, where each string represents a sentence. Each sentence consists of space-separated words. Example:

    [‘hello world’, ‘tokenization with lexicon’]

  • log_unknown_warning (bool) – Flag indicating whether to log warnings for out-of-vocabulary tokens. If True, warnings will be logged when encountering unknown tokens.

Returns:

token_ids – A list containing token IDs for each sentence in the input. The structure of the list is as follows: [

[ # For the first sentence

[token_id_1, token_id_2, …, token_id_n], [token_id_1, token_id_2, …, token_id_m], …

], [ # For the second sentence

[token_id_1, token_id_2, …, token_id_p], [token_id_1, token_id_2, …, token_id_q], …

] Each innermost list represents the token IDs for a word in the sentence.

Return type:

List[List[List[int]]]

texts_to_token_ids_with_multiple_pronunciation(texts: List[str], log_unknown_warning=True) List[List[List[List[int]]]][source]

Convert a list of input texts to token IDs with multiple pronunciation variants.

This method converts input texts into token IDs, considering multiple pronunciation variants. The resulting structure allows for handling various pronunciations of words within the given texts.

Parameters:
  • texts (List[str]) – A list of strings, where each string represents a sentence for an utterance. Each sentence consists of space-separated words.

  • log_unknown_warning (bool) – Indicates whether to log warnings for out-of-vocabulary (OOV) tokens. If set to True, warnings will be logged for OOV tokens during the conversion.

Returns:

token_ids – A nested list structure containing token IDs for each utterance. The structure is as follows: - Outer List: Represents different utterances. - Middle List: Represents different pronunciation variants for each utterance. - Inner List: Represents the sequence of token IDs for each pronunciation variant. - Innermost List: Represents the token IDs for each word in the sequence.

Return type:

List[List[List[List[int]]]]

arc_sort()[source]

Sort L, L_inv, L_disambig arcs of every state.

to(device: str = 'cpu')[source]

Device to move L, L_inv and L_disambig to

Parameters:

device (str) – The device

speechbrain.k2_integration.lexicon.prepare_char_lexicon(lang_dir, vocab_files, extra_csv_files=[], column_text_key='wrd', add_word_boundary=True)[source]

Read extra_csv_files to generate a $lang_dir/lexicon.txt for k2 training. This usually includes the csv files of the training set and the dev set in the output_folder. During training, we need to make sure that the lexicon.txt contains all (or the majority of) the words in the training set and the dev set.

NOTE: This assumes that the csv files contain the transcription in the last column.

Also note that in each csv_file, the first line is the header, and the remaining lines are in the following format:

ID, duration, wav, spk_id, wrd (transcription)

We only need the transcription in this function.

Writes out $lang_dir/lexicon.txt

Note that the lexicon.txt is a text file with the following format: word1 phone1 phone2 phone3 … word2 phone1 phone2 phone3 …

In this code, we simply use the characters in the word as the phones. You can use other phone sets, e.g., phonemes, BPEs, to train a better model.

Parameters:
  • lang_dir (str) – The directory to store the lexicon.txt

  • vocab_files (List[str]) – A list of extra vocab files. For example, for librispeech this could be the librispeech-vocab.txt file.

  • extra_csv_files (List[str]) – A list of csv file paths

  • column_text_key (str) – The column name of the transcription in the csv file. By default, it is “wrd”.

  • add_word_boundary (bool) – whether to add word boundary symbols <eow> at the end of each line to the lexicon for every word.

Example

>>> from speechbrain.k2_integration.lexicon import prepare_char_lexicon
>>> # Create some dummy csv files containing only the words `hello`, `world`.
>>> # The first line is the header, and the remaining lines are in the following
>>> # format:
>>> # ID, duration, wav, spk_id, wrd (transcription)
>>> csv_file = getfixture('tmpdir').join("train.csv")
>>> # Data to be written to the CSV file.
>>> import csv
>>> data = [
...    ["ID", "duration", "wav", "spk_id", "wrd"],
...    [1, 1, 1, 1, "hello world"],
...    [2, 0.5, 1, 1, "hello"]
... ]
>>> with open(csv_file, "w", newline="") as f:
...    writer = csv.writer(f)
...    writer.writerows(data)
>>> extra_csv_files = [csv_file]
>>> lang_dir = getfixture('tmpdir')
>>> vocab_files = []
>>> prepare_char_lexicon(lang_dir, vocab_files, extra_csv_files=extra_csv_files, add_word_boundary=False)
speechbrain.k2_integration.lexicon.read_lexicon(filename: str) List[Tuple[str, List[str]]][source]

Read a lexicon from filename.

Each line in the lexicon contains “word p1 p2 p3 …”. That is, the first field is a word and the remaining fields are tokens. Fields are separated by space(s).

Parameters:

filename (str) – Path to the lexicon.txt

Returns:

A list of tuples., e.g., [(‘w’, [‘p1’, ‘p2’]), (‘w1’, [‘p3, ‘p4’])]

Return type:

ans

speechbrain.k2_integration.lexicon.write_lexicon(filename: str | Path, lexicon: List[Tuple[str, List[str]]]) None[source]

Write a lexicon to a file.

Parameters:
  • filename (str) – Path to the lexicon file to be generated.

  • lexicon (List[Tuple[str, List[str]]]) – It can be the return value of read_lexicon().