speechbrain.tokenizers.SentencePiece module

Library for Byte-pair-encoding (BPE) tokenization. Authors

Abdelwahab Heba 2020

Loren Lugosch 2020

Summary

Classes:

SentencePiece

BPE class call the SentencePiece unsupervised text tokenizer from Google. Reference: https://github.com/google/sentencepiece SentencePiece lib is an unsupervised text tokenizer and detokenizer. It implements subword units like Byte-pair-encoding (BPE), Unigram language model and char/word tokenizer. :param model_dir: The directory where the model will be saved (or already stored). :type model_dir: str :param vocab_size: Vocab size for the chosen tokenizer type (BPE, Unigram). The vocab_size is optional for char, and mandatory for BPE & unigram tokenization. :type vocab_size: int, None, optional :param annotation_train: Path of the annotation file which is used to learn the tokenizer. It can be in JSON or csv format. :type annotation_train: str :param annotation_read: The data entry which contains the word sequence in the annotation file. :type annotation_read: str :param model_type: (bpe, char, unigram). If "bpe", train unsupervised tokenization of piece of words. see: https://www.aclweb.org/anthology/P16-1162/ If "word" take the vocabulary from the input text. If "unigram" do piece of word tokenization using unigram language model, see: https://arxiv.org/abs/1804.10959 :type model_type: str :param char_format_input: Whether the read entry contains characters format input. (default: False) (e.g., a p p l e _ i s _ g o o d) :type char_format_input: bool :param character_coverage: Amount of characters covered by the model, good defaults are: 0.9995 for languages with a rich character set like Japanese or Chinese and 1.0 for other languages with small character set. (default: 1.0) :type character_coverage: int :param user_defined_symbols: String contained a list of symbols separated by a comma. User-defined symbols are handled as one piece in any context. (default: None) :type user_defined_symbols: string :param max_sentencepiece_length: Maximum number of characters for the tokens. (default: 10) :type max_sentencepiece_length: int :param bos_id: If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1) :type bos_id: int :param eos_id: If -1 the eos_id = unk_id = 0. otherwise, eos_id = int. (default: -1) :type eos_id: int :param pad_id: If -1 the pad_id = unk_id = 0. otherwise, pad_id = int. (default: -1) :type pad_id: int :param unk_id: The token corresponding to an unknown symbol (not in token set). :type unk_id: int :param split_by_whitespace: If False, allow the sentencepiece to extract piece crossing multiple words. This feature is important for : Chinese/Japanese/Korean. (default: True) :type split_by_whitespace: bool :param num_sequences: If not none, use at most this many sequences to train the tokenizer (for large datasets). (default: None) :type num_sequences: int :param annotation_list_to_check: List of the annotation file which is used for checking the accuracy of recovering words from the tokenizer. :type annotation_list_to_check: list, :param annotation_format: The format of the annotation file. JSON or csv are the formats supported. :type annotation_format: str :param text_file: An alternate path to the text file (needed when multiple models are trained on the same data file) :type text_file: str :param add_dummy_prefix: If True the tokenizer adds dummy whitespace at the beginning of text. (default: True) :type add_dummy_prefix: bool.

SentencePieceDecoderStreamingContext

Mutable streaming context for a single SentencePiece streaming session.

Functions:

`get_spm_tokens`	Fetch list of tokens, can be indexed by token id
`spm_decode_preserve_leading_space`	Assuming the tokenizer is sentencepiece, decodes the input hypothesis but avoids incorrectly stripping leading spaces when streaming.

Reference

class speechbrain.tokenizers.SentencePiece.SentencePiece(model_dir, vocab_size, annotation_train=None, annotation_read=None, model_type='unigram', char_format_input=False, character_coverage=1.0, user_defined_symbols=None, max_sentencepiece_length=10, bos_id=-1, eos_id=-1, pad_id=-1, unk_id=0, split_by_whitespace=True, num_sequences=None, annotation_list_to_check=None, annotation_format='csv', text_file=None, add_dummy_prefix=True)[source]

Bases: object

BPE class call the SentencePiece unsupervised text tokenizer from Google. Reference: https://github.com/google/sentencepiece SentencePiece lib is an unsupervised text tokenizer and detokenizer. It implements subword units like Byte-pair-encoding (BPE), Unigram language model and char/word tokenizer. :param model_dir: The directory where the model will be saved (or already stored). :type model_dir: str :param vocab_size: Vocab size for the chosen tokenizer type (BPE, Unigram).

The vocab_size is optional for char, and mandatory for BPE & unigram tokenization.

Parameters:

annotation_train (str) – Path of the annotation file which is used to learn the tokenizer. It can be in JSON or csv format.
annotation_read (str) – The data entry which contains the word sequence in the annotation file.
model_type (str) – (bpe, char, unigram). If “bpe”, train unsupervised tokenization of piece of words. see: https://www.aclweb.org/anthology/P16-1162/ If “word” take the vocabulary from the input text. If “unigram” do piece of word tokenization using unigram language model, see: https://arxiv.org/abs/1804.10959
char_format_input (bool) – Whether the read entry contains characters format input. (default: False) (e.g., a p p l e _ i s _ g o o d)
character_coverage (int) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with a rich character set like Japanese or Chinese and 1.0 for other languages with small character set. (default: 1.0)
user_defined_symbols (string) – String contained a list of symbols separated by a comma. User-defined symbols are handled as one piece in any context. (default: None)
max_sentencepiece_length (int) – Maximum number of characters for the tokens. (default: 10)
bos_id (int) – If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1)
eos_id (int) – If -1 the eos_id = unk_id = 0. otherwise, eos_id = int. (default: -1)
pad_id (int) – If -1 the pad_id = unk_id = 0. otherwise, pad_id = int. (default: -1)
unk_id (int) – The token corresponding to an unknown symbol (not in token set).
split_by_whitespace (bool) – If False, allow the sentencepiece to extract piece crossing multiple words. This feature is important for : Chinese/Japanese/Korean. (default: True)
num_sequences (int) – If not none, use at most this many sequences to train the tokenizer (for large datasets). (default: None)
annotation_list_to_check (list,) – List of the annotation file which is used for checking the accuracy of recovering words from the tokenizer.
annotation_format (str) – The format of the annotation file. JSON or csv are the formats supported.
text_file (str) – An alternate path to the text file (needed when multiple models are trained on the same data file)
add_dummy_prefix (bool) – If True the tokenizer adds dummy whitespace at the beginning of text. (default: True)

Example

>>> import torch
>>> dict_int2lab = {1: "HELLO", 2: "MORNING"}
>>> model_dir = getfixture("tmpdir") / "tokenizer_data"
>>> # Example with csv
>>> annotation_train = "tests/samples/annotation/dev-clean.csv"
>>> annotation_read = "wrd"
>>> model_type = "bpe"
>>> bpe = SentencePiece(
...     str(model_dir), 100, annotation_train, annotation_read, model_type
... )
>>> batch_seq = torch.Tensor([[1, 2, 2, 1], [1, 2, 1, 0]])
>>> batch_lens = torch.Tensor([1.0, 0.75])
>>> encoded_seq_ids, encoded_seq_pieces = bpe(
...     batch_seq, batch_lens, dict_int2lab, task="encode"
... )
>>> # Example using JSON
>>> annotation_train = str(model_dir + "/dev-clean.json")
>>> annotation_read = "wrd"
>>> bpe = SentencePiece(
...     model_dir,
...     100,
...     annotation_train,
...     annotation_read,
...     model_type,
...     annotation_format="json",
... )
>>> encoded_seq_ids, encoded_seq_pieces = bpe(
...     batch_seq, batch_lens, dict_int2lab, task="encode"
... )

__call__(batch, batch_lens=None, ind2lab=None, task='encode')[source]

This __call__ function implements the tokenizer encoder and decoder (restoring the string of word) for BPE, Regularized BPE (with unigram), and char (speechbrain/nnet/RNN.py). :param batch: List if ( batch_lens = None and task = “decode_from_list”)

Contains the original labels. Shape: [batch_size, max_length]

Parameters:

batch_lens (tensor.LongTensor) – Containing the relative length of each label sequences. Must be 1D tensor of shape: [batch_size]. (default: None)
ind2lab (dict) – Dictionary which maps the index from label sequences (batch tensor) to string label.
task (str) –
(“encode”, “decode”, “decode_from_list) “encode”: convert the batch tensor into sequence of tokens.

the output contain a list of (tokens_seq, tokens_lens)

”decode”: convert a tensor of tokens to a list of word sequences. “decode_from_list”: convert a list of token sequences to a list

of word sequences.

speechbrain.tokenizers.SentencePiece.get_spm_tokens(model_path)[source]

Fetch list of tokens, can be indexed by token id

The resulting list can be used to map id to token.

Parameters:: model_path (str) – Path to SentencePiece model
Returns:: Tokens in order by id (can be indexed by id)
Return type:: list

class speechbrain.tokenizers.SentencePiece.SentencePieceDecoderStreamingContext(emitted_symbol_count: int = 0)[source]

Bases: object

Mutable streaming context for a single SentencePiece streaming session.

emitted_symbol_count: int = 0: The number of symbols that have been emitted for this transcription.

speechbrain.tokenizers.SentencePiece.spm_decode_preserve_leading_space(tokenizer: SentencePieceProcessor, hyps: List[int], context: SentencePieceDecoderStreamingContext) → List[str][source]

Assuming the tokenizer is sentencepiece, decodes the input hypothesis but avoids incorrectly stripping leading spaces when streaming. Operates on a single hypothesis, not a batch of hypotheses.

Normally, the tokenizer always decodes full sentences at a time, with the consequence that the first space in decoding will get removed. However, when streaming, we might be decoding mid-utterance where spaces must not be removed mid-sentence. This function handles this case.

e.g. if within the same streaming context, you decode ["▁how", "▁are"] then ["▁you"], the decoder would normally return "how areyou" instead of "how are you" like this function does.

Parameters:

tokenizer (sentencepiece.SentencePieceProcessor) – The SentencePiece processor to use for decoding.
hyps (list of output token hypotheses) – List of tokens to decode of any length >=0.
context (SentencePieceDecoderStreamingContext) – Mutable streaming context for the sentencepiece decoder, which should be reused across calls for the same decoding stream.

Returns:

Decoded text. Leading spaces are preserved, except at the start of a transcription.

Return type:

str