speechbrain.lobes.models.g2p.dataio module
Data pipeline elements for the G2P pipeline
- Authors
Loren Lugosch 2020
Mirco Ravanelli 2020
Artem Ploujnikov 2021 (minor refactoring only)
Summary
Classes:
A lazy initialization wrapper |
Functions:
Adds BOS and EOS tokens to the sequence provided |
|
Performs a Beam Search on the phonemes. |
|
Builds a map that maps arbitrary tokens to arbitrarily chosen characters. |
|
Returns a function that recovers the original sequence from one that has been tokenized using a character map |
|
Produces a list of consequtive characters |
|
Cleans incoming text, removing any characters not on the accepted list of graphemes and converting to uppercase |
|
Initializs the phoneme encoder with EOS/BOS sequences |
|
Exchanges keys and values in a dictionary |
|
Determines the key to be used for sequences (e.g. graphemes/phonemes) based on the naming convention. |
|
Encodes a grapheme sequence |
|
A wrapper to ensure that the specified object is initialzied only once (used mainly for tokenizers that train when the constructor is called |
|
Decodes a sequence of phonemes |
|
Encodes a sequence of phonemes using the encoder provided |
|
Converts a batch of phoneme sequences (a single tensor) to a list of space-separated phoneme label strings, (e.g. ["T AY B L", "B UH K"]), removing any special tokens. |
|
Removes any special tokens from the sequence. |
|
Decodes a sequence using a tokenizer. |
|
A pipeline element that uses a pretrained tokenizer |
|
Applies word embeddings, if applicable. |
Reference
- speechbrain.lobes.models.g2p.dataio.clean_pipeline(txt, graphemes)[source]
Cleans incoming text, removing any characters not on the accepted list of graphemes and converting to uppercase
- Parameters:
- Returns:
item – A wrapped transformation function
- Return type:
- speechbrain.lobes.models.g2p.dataio.grapheme_pipeline(char, grapheme_encoder=None, uppercase=True)[source]
Encodes a grapheme sequence
- Parameters:
graphemes (list) – a list of available graphemes
grapheme_encoder (speechbrain.dataio.encoder.TextEncoder) – a text encoder for graphemes. If not provided,
takes (str) – the name of the input
uppercase (bool) – whether or not to convert items to uppercase
- Returns:
grapheme_list (list) – a raw list of graphemes, excluding any non-matching labels
grapheme_encoded_list (list) – a list of graphemes encoded as integers
grapheme_encoded (torch.Tensor)
- speechbrain.lobes.models.g2p.dataio.tokenizer_encode_pipeline(seq, tokenizer, tokens, wordwise=True, word_separator=' ', token_space_index=512, char_map=None)[source]
A pipeline element that uses a pretrained tokenizer
- Parameters:
tokenizer (speechbrain.tokenizer.SentencePiece) – a tokenizer instance
tokens (str) – available tokens
takes (str) – the name of the pipeline input providing raw text
provides_prefix (str) – the prefix used for outputs
wordwise (str) – whether tokenization is peformed on the whole sequence or one word at a time. Tokenization can produce token sequences in which a token may span multiple words
token_space_index (int) – the index of the space token
char_map (dict) – a mapping from characters to tokens. This is used when tokenizing sequences of phonemes rather than sequences of characters. A sequence of phonemes is typically a list of one or two-character tokens (e.g. [“DH”, “UH”, “ “, “S”, “AW”, “N”, “D”]). The character map makes it possible to map these to arbitrarily selected characters
- Returns:
token_list (list) – a list of raw tokens
encoded_list (list) – a list of tokens, encoded as a list of integers
encoded (torch.Tensor) – a list of tokens, encoded as a tensor
- speechbrain.lobes.models.g2p.dataio.enable_eos_bos(tokens, encoder, bos_index, eos_index)[source]
Initializs the phoneme encoder with EOS/BOS sequences
- Parameters:
tokens (list) – a list of tokens
encoder (speechbrain.dataio.encoder.TextEncoder.) – a text encoder instance. If none is provided, a new one will be instantiated
bos_index (int) – the position corresponding to the Beginning-of-Sentence token
eos_index (int) – the position corresponding to the End-of-Sentence
- Returns:
encoder – an encoder
- Return type:
- speechbrain.lobes.models.g2p.dataio.phoneme_pipeline(phn, phoneme_encoder=None)[source]
Encodes a sequence of phonemes using the encoder provided
- Parameters:
phoneme_encoder (speechbrain.datio.encoder.TextEncoder) – a text encoder instance (optional, if not provided, a new one will be created)
- Returns:
phn (list) – the original list of phonemes
phn_encoded_list (list) – encoded phonemes, as a list
phn_encoded (torch.Tensor) – encoded phonemes, as a tensor
- speechbrain.lobes.models.g2p.dataio.add_bos_eos(seq=None, encoder=None)[source]
Adds BOS and EOS tokens to the sequence provided
- Parameters:
seq (torch.Tensor) – the source sequence
encoder (speechbrain.dataio.encoder.TextEncoder) – an encoder instance
- Returns:
seq_eos (torch.Tensor) – the sequence, with the EOS token added
seq_bos (torch.Tensor) – the sequence, with the BOS token added
- speechbrain.lobes.models.g2p.dataio.beam_search_pipeline(char_lens, encoder_out, beam_searcher)[source]
Performs a Beam Search on the phonemes. This function is meant to be used as a component in a decoding pipeline
- Parameters:
char_lens (torch.Tensor) – the length of character inputs
encoder_out (torch.Tensor) – Raw encoder outputs
beam_searcher (speechbrain.decoders.seq2seq.S2SBeamSearcher) – a SpeechBrain beam searcher instance
- Returns:
hyps (list) – hypotheses
scores (list) – confidence scores associated with each hypotheses
- speechbrain.lobes.models.g2p.dataio.phoneme_decoder_pipeline(hyps, phoneme_encoder)[source]
Decodes a sequence of phonemes
- speechbrain.lobes.models.g2p.dataio.char_range(start_char, end_char)[source]
Produces a list of consequtive characters
- speechbrain.lobes.models.g2p.dataio.build_token_char_map(tokens)[source]
Builds a map that maps arbitrary tokens to arbitrarily chosen characters. This is required to overcome the limitations of SentencePiece.
- speechbrain.lobes.models.g2p.dataio.flip_map(map_dict)[source]
Exchanges keys and values in a dictionary
- speechbrain.lobes.models.g2p.dataio.text_decode(seq, encoder)[source]
Decodes a sequence using a tokenizer. This function is meant to be used in hparam files
- Parameters:
seq (torch.Tensor) – token indexes
encoder (sb.dataio.encoder.TextEncoder) – a text encoder instance
- Returns:
output_seq – a list of lists of tokens
- Return type:
- speechbrain.lobes.models.g2p.dataio.char_map_detokenize(char_map, tokenizer, token_space_index=None, wordwise=True)[source]
Returns a function that recovers the original sequence from one that has been tokenized using a character map
- Parameters:
char_map (dict) – a character-to-output-token-map
tokenizer (speechbrain.tokenizers.SentencePiece.SentencePiece) – a tokenizer instance
token_space_index (int) – the index of the “space” token
- Returns:
f – the tokenizer function
- Return type:
callable
- class speechbrain.lobes.models.g2p.dataio.LazyInit(init)[source]
Bases:
Module
A lazy initialization wrapper
- Parameters:
init (callable) – The function to initialize the underlying object
- to(device)[source]
Moves the underlying object to the specified device
- Parameters:
device (str | torch.device) – the device
- speechbrain.lobes.models.g2p.dataio.lazy_init(init)[source]
A wrapper to ensure that the specified object is initialzied only once (used mainly for tokenizers that train when the constructor is called
- Parameters:
init (callable) – a constructor or function that creates an object
- Returns:
instance – the object instance
- Return type:
- speechbrain.lobes.models.g2p.dataio.get_sequence_key(key, mode)[source]
Determines the key to be used for sequences (e.g. graphemes/phonemes) based on the naming convention
- Parameters:
key (str) – the key (e.g. “graphemes”, “phonemes”)
mode – the mode/sufix (raw, eos/bos)
- speechbrain.lobes.models.g2p.dataio.phonemes_to_label(phns, decoder)[source]
Converts a batch of phoneme sequences (a single tensor) to a list of space-separated phoneme label strings, (e.g. [“T AY B L”, “B UH K”]), removing any special tokens
- Parameters:
phn (sequence) – a batch of phoneme sequences
- Returns:
result – a list of strings corresponding to the phonemes provided
- Return type:
- speechbrain.lobes.models.g2p.dataio.remove_special(phn)[source]
Removes any special tokens from the sequence. Special tokens are delimited by angle brackets.
- speechbrain.lobes.models.g2p.dataio.word_emb_pipeline(txt, grapheme_encoded, grapheme_encoded_len, grapheme_encoder=None, word_emb=None, use_word_emb=None)[source]
Applies word embeddings, if applicable. This function is meant to be used as part of the encoding pipeline
- Parameters:
txt (str) – the raw text
grapheme_encoded (torch.tensor) – the encoded graphemes
grapheme_encoded_len (torch.tensor) – encoded grapheme lengths
grapheme_encoder (speechbrain.dataio.encoder.TextEncoder) – the text encoder used for graphemes
word_emb (callable) – the model that produces word embeddings
use_word_emb (bool) – a flag indicated if word embeddings are to be applied
- Returns:
char_word_emb – Word embeddings, expanded to the character dimension
- Return type:
torch.tensor