speechbrain.lobes.models.g2p.dataio module

Data pipeline elements for the G2P pipeline

Authors

Loren Lugosch 2020
Mirco Ravanelli 2020
Artem Ploujnikov 2021 (minor refactoring only)

Summary

Classes:

LazyInit

A lazy initialization wrapper

Functions:

`add_bos_eos`	Adds BOS and EOS tokens to the sequence provided
`beam_search_pipeline`	Performs a Beam Search on the phonemes.
`build_token_char_map`	Builds a map that maps arbitrary tokens to arbitrarily chosen characters.
`char_map_detokenize`	Returns a function that recovers the original sequence from one that has been tokenized using a character map
`char_range`	Produces a list of consecutive characters
`clean_pipeline`	Cleans incoming text, removing any characters not on the accepted list of graphemes and converting to uppercase
`enable_eos_bos`	Initializes the phoneme encoder with EOS/BOS sequences
`flip_map`	Exchanges keys and values in a dictionary
`get_sequence_key`	Determines the key to be used for sequences (e.g. graphemes/phonemes) based on the naming convention.
`grapheme_pipeline`	Encodes a grapheme sequence
`lazy_init`	A wrapper to ensure that the specified object is initialized only once (used mainly for tokenizers that train when the constructor is called
`phoneme_decoder_pipeline`	Decodes a sequence of phonemes
`phoneme_pipeline`	Encodes a sequence of phonemes using the encoder provided
`phonemes_to_label`	Converts a batch of phoneme sequences (a single tensor) to a list of space-separated phoneme label strings, (e.g. ["T AY B L", "B UH K"]), removing any special tokens.
`remove_special`	Removes any special tokens from the sequence.
`text_decode`	Decodes a sequence using a tokenizer.
`tokenizer_encode_pipeline`	A pipeline element that uses a pretrained tokenizer
`word_emb_pipeline`	Applies word embeddings, if applicable.

Reference

speechbrain.lobes.models.g2p.dataio.clean_pipeline(txt, graphemes)[source]

Cleans incoming text, removing any characters not on the accepted list of graphemes and converting to uppercase

Parameters:

txt (str) – the text to clean up
graphemes (list) – a list of graphemes

Returns:

item – A wrapped transformation function

Return type:

DynamicItem

speechbrain.lobes.models.g2p.dataio.grapheme_pipeline(char, grapheme_encoder=None, uppercase=True)[source]

Encodes a grapheme sequence

Parameters:

char (str) – A list of characters to encode.
grapheme_encoder (speechbrain.dataio.encoder.TextEncoder) – a text encoder for graphemes. If not provided,
uppercase (bool) – whether or not to convert items to uppercase

Yields:

grapheme_list (list) – a raw list of graphemes, excluding any non-matching labels
grapheme_encoded_list (list) – a list of graphemes encoded as integers
grapheme_encoded (torch.Tensor)

speechbrain.lobes.models.g2p.dataio.tokenizer_encode_pipeline(seq, tokenizer, tokens, wordwise=True, word_separator=' ', token_space_index=512, char_map=None)[source]

A pipeline element that uses a pretrained tokenizer

Parameters:

seq (list) – List of tokens to encode.
tokenizer (speechbrain.tokenizer.SentencePiece) – a tokenizer instance
tokens (str) – available tokens
wordwise (str) – whether tokenization is performed on the whole sequence or one word at a time. Tokenization can produce token sequences in which a token may span multiple words
word_separator (str) – The substring to use as a separator between words.
token_space_index (int) – the index of the space token
char_map (dict) – a mapping from characters to tokens. This is used when tokenizing sequences of phonemes rather than sequences of characters. A sequence of phonemes is typically a list of one or two-character tokens (e.g. [“DH”, “UH”, “ “, “S”, “AW”, “N”, “D”]). The character map makes it possible to map these to arbitrarily selected characters

Yields:

token_list (list) – a list of raw tokens
encoded_list (list) – a list of tokens, encoded as a list of integers
encoded (torch.Tensor) – a list of tokens, encoded as a tensor

speechbrain.lobes.models.g2p.dataio.enable_eos_bos(tokens, encoder, bos_index, eos_index)[source]

Initializes the phoneme encoder with EOS/BOS sequences

Parameters:

tokens (list) – a list of tokens
encoder (speechbrain.dataio.encoder.TextEncoder.) – a text encoder instance. If none is provided, a new one will be instantiated
bos_index (int) – the position corresponding to the Beginning-of-Sentence token
eos_index (int) – the position corresponding to the End-of-Sentence

Returns:

encoder – an encoder

Return type:

speechbrain.dataio.encoder.TextEncoder

speechbrain.lobes.models.g2p.dataio.phoneme_pipeline(phn, phoneme_encoder=None)[source]

Encodes a sequence of phonemes using the encoder provided

Parameters:

phn (list) – List of phonemes
phoneme_encoder (speechbrain.datio.encoder.TextEncoder) – a text encoder instance (optional, if not provided, a new one will be created)

Yields:

phn (list) – the original list of phonemes
phn_encoded_list (list) – encoded phonemes, as a list
phn_encoded (torch.Tensor) – encoded phonemes, as a tensor

speechbrain.lobes.models.g2p.dataio.add_bos_eos(seq=None, encoder=None)[source]

Adds BOS and EOS tokens to the sequence provided

Parameters:

seq (torch.Tensor) – the source sequence
encoder (speechbrain.dataio.encoder.TextEncoder) – an encoder instance

Yields:

seq_eos (torch.Tensor) – the sequence, with the EOS token added
seq_bos (torch.Tensor) – the sequence, with the BOS token added

speechbrain.lobes.models.g2p.dataio.beam_search_pipeline(char_lens, encoder_out, beam_searcher)[source]

Performs a Beam Search on the phonemes. This function is meant to be used as a component in a decoding pipeline

Parameters:

char_lens (torch.Tensor) – the length of character inputs
encoder_out (torch.Tensor) – Raw encoder outputs
beam_searcher (speechbrain.decoders.seq2seq.S2SBeamSearcher) – a SpeechBrain beam searcher instance

Returns:

hyps (list) – hypotheses
scores (list) – confidence scores associated with each hypotheses

speechbrain.lobes.models.g2p.dataio.phoneme_decoder_pipeline(hyps, phoneme_encoder)[source]

Decodes a sequence of phonemes

Parameters:

hyps (list) – hypotheses, the output of a beam search
phoneme_encoder (speechbrain.datio.encoder.TextEncoder) – a text encoder instance

Returns:

phonemes – the phoneme sequence

Return type:

list

speechbrain.lobes.models.g2p.dataio.char_range(start_char, end_char)[source]

Produces a list of consecutive characters

Parameters:

start_char (str) – the starting character
end_char (str) – the ending characters

Returns:

char_range – the character range

Return type:

str

speechbrain.lobes.models.g2p.dataio.build_token_char_map(tokens)[source]

Builds a map that maps arbitrary tokens to arbitrarily chosen characters. This is required to overcome the limitations of SentencePiece.

Parameters:: tokens (list) – a list of tokens for which to produce the map
Returns:: token_map – a dictionary with original tokens as keys and new mappings as values
Return type:: dict

speechbrain.lobes.models.g2p.dataio.flip_map(map_dict)[source]

Exchanges keys and values in a dictionary

Parameters:: map_dict (dict) – a dictionary
Returns:: reverse_map_dict – a dictionary with keys and values flipped
Return type:: dict

speechbrain.lobes.models.g2p.dataio.text_decode(seq, encoder)[source]

Decodes a sequence using a tokenizer. This function is meant to be used in hparam files

Parameters:

seq (torch.Tensor) – token indexes
encoder (sb.dataio.encoder.TextEncoder) – a text encoder instance

Returns:

output_seq – a list of lists of tokens

Return type:

list

speechbrain.lobes.models.g2p.dataio.char_map_detokenize(char_map, tokenizer, token_space_index=None, wordwise=True)[source]

Returns a function that recovers the original sequence from one that has been tokenized using a character map

Parameters:

char_map (dict) – a character-to-output-token-map
tokenizer (speechbrain.tokenizers.SentencePiece.SentencePiece) – a tokenizer instance
token_space_index (int) – the index of the “space” token
wordwise (bool) – Whether to apply detokenize per word.

Returns:

f – the tokenizer function

Return type:

callable

class speechbrain.lobes.models.g2p.dataio.LazyInit(init)[source]

Bases: Module

A lazy initialization wrapper

Parameters:: init (callable) – The function to initialize the underlying object

__call__()[source]: Initializes the object instance, if necessary and returns it.

to(device)[source]

Moves the underlying object to the specified device

Parameters:: device (str | torch.device) – the device
Return type:: self

speechbrain.lobes.models.g2p.dataio.lazy_init(init)[source]

A wrapper to ensure that the specified object is initialized only once (used mainly for tokenizers that train when the constructor is called

Parameters:: init (callable) – a constructor or function that creates an object
Returns:: instance – the object instance
Return type:: object

speechbrain.lobes.models.g2p.dataio.get_sequence_key(key, mode)[source]

Determines the key to be used for sequences (e.g. graphemes/phonemes) based on the naming convention

Parameters:

key (str) – the key (e.g. “graphemes”, “phonemes”)
mode (str) – the mode/suffix (raw, eos/bos)

Return type:

key if mode=="raw" else f"{key}_{mode}"

speechbrain.lobes.models.g2p.dataio.phonemes_to_label(phns, decoder)[source]

Converts a batch of phoneme sequences (a single tensor) to a list of space-separated phoneme label strings, (e.g. [“T AY B L”, “B UH K”]), removing any special tokens

Parameters:

phns (torch.Tensor) – a batch of phoneme sequences
decoder (Callable) – Converts tensor to phoneme label strings.

Returns:

result – a list of strings corresponding to the phonemes provided

Return type:

list

speechbrain.lobes.models.g2p.dataio.remove_special(phn)[source]

Removes any special tokens from the sequence. Special tokens are delimited by angle brackets.

Parameters:: phn (list) – a list of phoneme labels
Returns:: result – the original list, without any special tokens
Return type:: list

speechbrain.lobes.models.g2p.dataio.word_emb_pipeline(txt, grapheme_encoded, grapheme_encoded_len, grapheme_encoder=None, word_emb=None, use_word_emb=None)[source]

Applies word embeddings, if applicable. This function is meant to be used as part of the encoding pipeline

Parameters:

txt (str) – the raw text
grapheme_encoded (torch.Tensor) – the encoded graphemes
grapheme_encoded_len (torch.Tensor) – encoded grapheme lengths
grapheme_encoder (speechbrain.dataio.encoder.TextEncoder) – the text encoder used for graphemes
word_emb (callable) – the model that produces word embeddings
use_word_emb (bool) – a flag indicated if word embeddings are to be applied

Returns:

char_word_emb – Word embeddings, expanded to the character dimension

Return type:

torch.Tensor