speechbrain.lobes.models.g2p.dataio moduleο
Data pipeline elements for the G2P pipeline
- Authors
Loren Lugosch 2020
Mirco Ravanelli 2020
Artem Ploujnikov 2021 (minor refactoring only)
Summaryο
Classes:
A lazy initialization wrapper |
Functions:
Adds BOS and EOS tokens to the sequence provided |
|
Performs a Beam Search on the phonemes. |
|
Builds a map that maps arbitrary tokens to arbitrarily chosen characters. |
|
Returns a function that recovers the original sequence from one that has been tokenized using a character map |
|
Produces a list of consecutive characters |
|
Cleans incoming text, removing any characters not on the accepted list of graphemes and converting to uppercase |
|
Initializes the phoneme encoder with EOS/BOS sequences |
|
Exchanges keys and values in a dictionary |
|
Determines the key to be used for sequences (e.g. graphemes/phonemes) based on the naming convention. |
|
Encodes a grapheme sequence |
|
A wrapper to ensure that the specified object is initialized only once (used mainly for tokenizers that train when the constructor is called |
|
Decodes a sequence of phonemes |
|
Encodes a sequence of phonemes using the encoder provided |
|
Converts a batch of phoneme sequences (a single tensor) to a list of space-separated phoneme label strings, (e.g. ["T AY B L", "B UH K"]), removing any special tokens. |
|
Removes any special tokens from the sequence. |
|
Decodes a sequence using a tokenizer. |
|
A pipeline element that uses a pretrained tokenizer |
|
Applies word embeddings, if applicable. |
Referenceο
- speechbrain.lobes.models.g2p.dataio.clean_pipeline(txt, graphemes)[source]ο
Cleans incoming text, removing any characters not on the accepted list of graphemes and converting to uppercase
- Parameters:
- Returns:
item β A wrapped transformation function
- Return type:
- speechbrain.lobes.models.g2p.dataio.grapheme_pipeline(char, grapheme_encoder=None, uppercase=True)[source]ο
Encodes a grapheme sequence
- Parameters:
char (str) β A list of characters to encode.
grapheme_encoder (speechbrain.dataio.encoder.TextEncoder) β a text encoder for graphemes. If not provided,
uppercase (bool) β whether or not to convert items to uppercase
- Yields:
grapheme_list (list) β a raw list of graphemes, excluding any non-matching labels
grapheme_encoded_list (list) β a list of graphemes encoded as integers
grapheme_encoded (torch.Tensor)
- speechbrain.lobes.models.g2p.dataio.tokenizer_encode_pipeline(seq, tokenizer, tokens, wordwise=True, word_separator=' ', token_space_index=512, char_map=None)[source]ο
A pipeline element that uses a pretrained tokenizer
- Parameters:
seq (list) β List of tokens to encode.
tokenizer (speechbrain.tokenizer.SentencePiece) β a tokenizer instance
tokens (str) β available tokens
wordwise (str) β whether tokenization is performed on the whole sequence or one word at a time. Tokenization can produce token sequences in which a token may span multiple words
word_separator (str) β The substring to use as a separator between words.
token_space_index (int) β the index of the space token
char_map (dict) β a mapping from characters to tokens. This is used when tokenizing sequences of phonemes rather than sequences of characters. A sequence of phonemes is typically a list of one or two-character tokens (e.g. [βDHβ, βUHβ, β β, βSβ, βAWβ, βNβ, βDβ]). The character map makes it possible to map these to arbitrarily selected characters
- Yields:
token_list (list) β a list of raw tokens
encoded_list (list) β a list of tokens, encoded as a list of integers
encoded (torch.Tensor) β a list of tokens, encoded as a tensor
- speechbrain.lobes.models.g2p.dataio.enable_eos_bos(tokens, encoder, bos_index, eos_index)[source]ο
Initializes the phoneme encoder with EOS/BOS sequences
- Parameters:
tokens (list) β a list of tokens
encoder (speechbrain.dataio.encoder.TextEncoder.) β a text encoder instance. If none is provided, a new one will be instantiated
bos_index (int) β the position corresponding to the Beginning-of-Sentence token
eos_index (int) β the position corresponding to the End-of-Sentence
- Returns:
encoder β an encoder
- Return type:
- speechbrain.lobes.models.g2p.dataio.phoneme_pipeline(phn, phoneme_encoder=None)[source]ο
Encodes a sequence of phonemes using the encoder provided
- Parameters:
phn (list) β List of phonemes
phoneme_encoder (speechbrain.datio.encoder.TextEncoder) β a text encoder instance (optional, if not provided, a new one will be created)
- Yields:
phn (list) β the original list of phonemes
phn_encoded_list (list) β encoded phonemes, as a list
phn_encoded (torch.Tensor) β encoded phonemes, as a tensor
- speechbrain.lobes.models.g2p.dataio.add_bos_eos(seq=None, encoder=None)[source]ο
Adds BOS and EOS tokens to the sequence provided
- Parameters:
seq (torch.Tensor) β the source sequence
encoder (speechbrain.dataio.encoder.TextEncoder) β an encoder instance
- Yields:
seq_eos (torch.Tensor) β the sequence, with the EOS token added
seq_bos (torch.Tensor) β the sequence, with the BOS token added
- speechbrain.lobes.models.g2p.dataio.beam_search_pipeline(char_lens, encoder_out, beam_searcher)[source]ο
Performs a Beam Search on the phonemes. This function is meant to be used as a component in a decoding pipeline
- Parameters:
char_lens (torch.Tensor) β the length of character inputs
encoder_out (torch.Tensor) β Raw encoder outputs
beam_searcher (speechbrain.decoders.seq2seq.S2SBeamSearcher) β a SpeechBrain beam searcher instance
- Returns:
hyps (list) β hypotheses
scores (list) β confidence scores associated with each hypotheses
- speechbrain.lobes.models.g2p.dataio.phoneme_decoder_pipeline(hyps, phoneme_encoder)[source]ο
Decodes a sequence of phonemes
- speechbrain.lobes.models.g2p.dataio.char_range(start_char, end_char)[source]ο
Produces a list of consecutive characters
- speechbrain.lobes.models.g2p.dataio.build_token_char_map(tokens)[source]ο
Builds a map that maps arbitrary tokens to arbitrarily chosen characters. This is required to overcome the limitations of SentencePiece.
- speechbrain.lobes.models.g2p.dataio.flip_map(map_dict)[source]ο
Exchanges keys and values in a dictionary
- speechbrain.lobes.models.g2p.dataio.text_decode(seq, encoder)[source]ο
Decodes a sequence using a tokenizer. This function is meant to be used in hparam files
- Parameters:
seq (torch.Tensor) β token indexes
encoder (sb.dataio.encoder.TextEncoder) β a text encoder instance
- Returns:
output_seq β a list of lists of tokens
- Return type:
- speechbrain.lobes.models.g2p.dataio.char_map_detokenize(char_map, tokenizer, token_space_index=None, wordwise=True)[source]ο
Returns a function that recovers the original sequence from one that has been tokenized using a character map
- Parameters:
char_map (dict) β a character-to-output-token-map
tokenizer (speechbrain.tokenizers.SentencePiece.SentencePiece) β a tokenizer instance
token_space_index (int) β the index of the βspaceβ token
wordwise (bool) β Whether to apply detokenize per word.
- Returns:
f β the tokenizer function
- Return type:
callable
- class speechbrain.lobes.models.g2p.dataio.LazyInit(init)[source]ο
Bases:
Module
A lazy initialization wrapper
- Parameters:
init (callable) β The function to initialize the underlying object
- speechbrain.lobes.models.g2p.dataio.lazy_init(init)[source]ο
A wrapper to ensure that the specified object is initialized only once (used mainly for tokenizers that train when the constructor is called
- Parameters:
init (callable) β a constructor or function that creates an object
- Returns:
instance β the object instance
- Return type:
- speechbrain.lobes.models.g2p.dataio.get_sequence_key(key, mode)[source]ο
Determines the key to be used for sequences (e.g. graphemes/phonemes) based on the naming convention
- speechbrain.lobes.models.g2p.dataio.phonemes_to_label(phns, decoder)[source]ο
Converts a batch of phoneme sequences (a single tensor) to a list of space-separated phoneme label strings, (e.g. [βT AY B Lβ, βB UH Kβ]), removing any special tokens
- Parameters:
phns (torch.Tensor) β a batch of phoneme sequences
decoder (Callable) β Converts tensor to phoneme label strings.
- Returns:
result β a list of strings corresponding to the phonemes provided
- Return type:
- speechbrain.lobes.models.g2p.dataio.remove_special(phn)[source]ο
Removes any special tokens from the sequence. Special tokens are delimited by angle brackets.
- speechbrain.lobes.models.g2p.dataio.word_emb_pipeline(txt, grapheme_encoded, grapheme_encoded_len, grapheme_encoder=None, word_emb=None, use_word_emb=None)[source]ο
Applies word embeddings, if applicable. This function is meant to be used as part of the encoding pipeline
- Parameters:
txt (str) β the raw text
grapheme_encoded (torch.Tensor) β the encoded graphemes
grapheme_encoded_len (torch.Tensor) β encoded grapheme lengths
grapheme_encoder (speechbrain.dataio.encoder.TextEncoder) β the text encoder used for graphemes
word_emb (callable) β the model that produces word embeddings
use_word_emb (bool) β a flag indicated if word embeddings are to be applied
- Returns:
char_word_emb β Word embeddings, expanded to the character dimension
- Return type:
torch.Tensor