speechbrain.lobes.models.g2p.dataio module

Data pipeline elements for the G2P pipeline

Authors
  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Artem Ploujnikov 2021 (minor refactoring only)

Summary

Classes:

LazyInit

A lazy initialization wrapper

Functions:

add_bos_eos

Adds BOS and EOS tokens to the sequence provided

beam_search_pipeline

Performs a Beam Search on the phonemes.

build_token_char_map

Builds a map that maps arbitrary tokens to arbitrarily chosen characters.

char_map_detokenize

Returns a function that recovers the original sequence from one that has been tokenized using a character map

char_range

Produces a list of consecutive characters

clean_pipeline

Cleans incoming text, removing any characters not on the accepted list of graphemes and converting to uppercase

enable_eos_bos

Initializes the phoneme encoder with EOS/BOS sequences

flip_map

Exchanges keys and values in a dictionary

get_sequence_key

Determines the key to be used for sequences (e.g. graphemes/phonemes) based on the naming convention.

grapheme_pipeline

Encodes a grapheme sequence

lazy_init

A wrapper to ensure that the specified object is initialized only once (used mainly for tokenizers that train when the constructor is called

phoneme_decoder_pipeline

Decodes a sequence of phonemes

phoneme_pipeline

Encodes a sequence of phonemes using the encoder provided

phonemes_to_label

Converts a batch of phoneme sequences (a single tensor) to a list of space-separated phoneme label strings, (e.g. ["T AY B L", "B UH K"]), removing any special tokens.

remove_special

Removes any special tokens from the sequence.

text_decode

Decodes a sequence using a tokenizer.

tokenizer_encode_pipeline

A pipeline element that uses a pretrained tokenizer

word_emb_pipeline

Applies word embeddings, if applicable.

Reference

speechbrain.lobes.models.g2p.dataio.clean_pipeline(txt, graphemes)[source]

Cleans incoming text, removing any characters not on the accepted list of graphemes and converting to uppercase

Parameters:
  • txt (str) – the text to clean up

  • graphemes (list) – a list of graphemes

Returns:

item – A wrapped transformation function

Return type:

DynamicItem

speechbrain.lobes.models.g2p.dataio.grapheme_pipeline(char, grapheme_encoder=None, uppercase=True)[source]

Encodes a grapheme sequence

Parameters:
  • char (str) – A list of characters to encode.

  • grapheme_encoder (speechbrain.dataio.encoder.TextEncoder) – a text encoder for graphemes. If not provided,

  • uppercase (bool) – whether or not to convert items to uppercase

Yields:
  • grapheme_list (list) – a raw list of graphemes, excluding any non-matching labels

  • grapheme_encoded_list (list) – a list of graphemes encoded as integers

  • grapheme_encoded (torch.Tensor)

speechbrain.lobes.models.g2p.dataio.tokenizer_encode_pipeline(seq, tokenizer, tokens, wordwise=True, word_separator=' ', token_space_index=512, char_map=None)[source]

A pipeline element that uses a pretrained tokenizer

Parameters:
  • seq (list) – List of tokens to encode.

  • tokenizer (speechbrain.tokenizer.SentencePiece) – a tokenizer instance

  • tokens (str) – available tokens

  • wordwise (str) – whether tokenization is performed on the whole sequence or one word at a time. Tokenization can produce token sequences in which a token may span multiple words

  • word_separator (str) – The substring to use as a separator between words.

  • token_space_index (int) – the index of the space token

  • char_map (dict) – a mapping from characters to tokens. This is used when tokenizing sequences of phonemes rather than sequences of characters. A sequence of phonemes is typically a list of one or two-character tokens (e.g. [β€œDH”, β€œUH”, β€œ β€œ, β€œS”, β€œAW”, β€œN”, β€œD”]). The character map makes it possible to map these to arbitrarily selected characters

Yields:
  • token_list (list) – a list of raw tokens

  • encoded_list (list) – a list of tokens, encoded as a list of integers

  • encoded (torch.Tensor) – a list of tokens, encoded as a tensor

speechbrain.lobes.models.g2p.dataio.enable_eos_bos(tokens, encoder, bos_index, eos_index)[source]

Initializes the phoneme encoder with EOS/BOS sequences

Parameters:
  • tokens (list) – a list of tokens

  • encoder (speechbrain.dataio.encoder.TextEncoder.) – a text encoder instance. If none is provided, a new one will be instantiated

  • bos_index (int) – the position corresponding to the Beginning-of-Sentence token

  • eos_index (int) – the position corresponding to the End-of-Sentence

Returns:

encoder – an encoder

Return type:

speechbrain.dataio.encoder.TextEncoder

speechbrain.lobes.models.g2p.dataio.phoneme_pipeline(phn, phoneme_encoder=None)[source]

Encodes a sequence of phonemes using the encoder provided

Parameters:
  • phn (list) – List of phonemes

  • phoneme_encoder (speechbrain.datio.encoder.TextEncoder) – a text encoder instance (optional, if not provided, a new one will be created)

Yields:
  • phn (list) – the original list of phonemes

  • phn_encoded_list (list) – encoded phonemes, as a list

  • phn_encoded (torch.Tensor) – encoded phonemes, as a tensor

speechbrain.lobes.models.g2p.dataio.add_bos_eos(seq=None, encoder=None)[source]

Adds BOS and EOS tokens to the sequence provided

Parameters:
Yields:
  • seq_eos (torch.Tensor) – the sequence, with the EOS token added

  • seq_bos (torch.Tensor) – the sequence, with the BOS token added

speechbrain.lobes.models.g2p.dataio.beam_search_pipeline(char_lens, encoder_out, beam_searcher)[source]

Performs a Beam Search on the phonemes. This function is meant to be used as a component in a decoding pipeline

Parameters:
  • char_lens (torch.Tensor) – the length of character inputs

  • encoder_out (torch.Tensor) – Raw encoder outputs

  • beam_searcher (speechbrain.decoders.seq2seq.S2SBeamSearcher) – a SpeechBrain beam searcher instance

Returns:

  • hyps (list) – hypotheses

  • scores (list) – confidence scores associated with each hypotheses

speechbrain.lobes.models.g2p.dataio.phoneme_decoder_pipeline(hyps, phoneme_encoder)[source]

Decodes a sequence of phonemes

Parameters:
  • hyps (list) – hypotheses, the output of a beam search

  • phoneme_encoder (speechbrain.datio.encoder.TextEncoder) – a text encoder instance

Returns:

phonemes – the phoneme sequence

Return type:

list

speechbrain.lobes.models.g2p.dataio.char_range(start_char, end_char)[source]

Produces a list of consecutive characters

Parameters:
  • start_char (str) – the starting character

  • end_char (str) – the ending characters

Returns:

char_range – the character range

Return type:

str

speechbrain.lobes.models.g2p.dataio.build_token_char_map(tokens)[source]

Builds a map that maps arbitrary tokens to arbitrarily chosen characters. This is required to overcome the limitations of SentencePiece.

Parameters:

tokens (list) – a list of tokens for which to produce the map

Returns:

token_map – a dictionary with original tokens as keys and new mappings as values

Return type:

dict

speechbrain.lobes.models.g2p.dataio.flip_map(map_dict)[source]

Exchanges keys and values in a dictionary

Parameters:

map_dict (dict) – a dictionary

Returns:

reverse_map_dict – a dictionary with keys and values flipped

Return type:

dict

speechbrain.lobes.models.g2p.dataio.text_decode(seq, encoder)[source]

Decodes a sequence using a tokenizer. This function is meant to be used in hparam files

Parameters:
  • seq (torch.Tensor) – token indexes

  • encoder (sb.dataio.encoder.TextEncoder) – a text encoder instance

Returns:

output_seq – a list of lists of tokens

Return type:

list

speechbrain.lobes.models.g2p.dataio.char_map_detokenize(char_map, tokenizer, token_space_index=None, wordwise=True)[source]

Returns a function that recovers the original sequence from one that has been tokenized using a character map

Parameters:
Returns:

f – the tokenizer function

Return type:

callable

class speechbrain.lobes.models.g2p.dataio.LazyInit(init)[source]

Bases: Module

A lazy initialization wrapper

Parameters:

init (callable) – The function to initialize the underlying object

__call__()[source]

Initializes the object instance, if necessary and returns it.

to(device)[source]

Moves the underlying object to the specified device

Parameters:

device (str | torch.device) – the device

Return type:

self

speechbrain.lobes.models.g2p.dataio.lazy_init(init)[source]

A wrapper to ensure that the specified object is initialized only once (used mainly for tokenizers that train when the constructor is called

Parameters:

init (callable) – a constructor or function that creates an object

Returns:

instance – the object instance

Return type:

object

speechbrain.lobes.models.g2p.dataio.get_sequence_key(key, mode)[source]

Determines the key to be used for sequences (e.g. graphemes/phonemes) based on the naming convention

Parameters:
  • key (str) – the key (e.g. β€œgraphemes”, β€œphonemes”)

  • mode (str) – the mode/suffix (raw, eos/bos)

Return type:

key if mode=="raw" else f"{key}_{mode}"

speechbrain.lobes.models.g2p.dataio.phonemes_to_label(phns, decoder)[source]

Converts a batch of phoneme sequences (a single tensor) to a list of space-separated phoneme label strings, (e.g. [β€œT AY B L”, β€œB UH K”]), removing any special tokens

Parameters:
  • phns (torch.Tensor) – a batch of phoneme sequences

  • decoder (Callable) – Converts tensor to phoneme label strings.

Returns:

result – a list of strings corresponding to the phonemes provided

Return type:

list

speechbrain.lobes.models.g2p.dataio.remove_special(phn)[source]

Removes any special tokens from the sequence. Special tokens are delimited by angle brackets.

Parameters:

phn (list) – a list of phoneme labels

Returns:

result – the original list, without any special tokens

Return type:

list

speechbrain.lobes.models.g2p.dataio.word_emb_pipeline(txt, grapheme_encoded, grapheme_encoded_len, grapheme_encoder=None, word_emb=None, use_word_emb=None)[source]

Applies word embeddings, if applicable. This function is meant to be used as part of the encoding pipeline

Parameters:
  • txt (str) – the raw text

  • grapheme_encoded (torch.Tensor) – the encoded graphemes

  • grapheme_encoded_len (torch.Tensor) – encoded grapheme lengths

  • grapheme_encoder (speechbrain.dataio.encoder.TextEncoder) – the text encoder used for graphemes

  • word_emb (callable) – the model that produces word embeddings

  • use_word_emb (bool) – a flag indicated if word embeddings are to be applied

Returns:

char_word_emb – Word embeddings, expanded to the character dimension

Return type:

torch.Tensor