speechbrain.tokenizers.SentencePiece module

Library for Byte-pair-encoding (BPE) tokenization. Authors

  • Abdelwahab Heba 2020

  • Loren Lugosch 2020

Summary

Classes:

SentencePiece

BPE class call the SentencePiece unsupervised text tokenizer from Google. Reference: https://github.com/google/sentencepiece SentencePiece lib is an unsupervised text tokenizer and detokenizer. It implements subword units like Byte-pair-encoding (BPE), Unigram language model and char/word tokenizer. :param model_dir: The directory where the model will be saved (or already stored). :type model_dir: str :param vocab_size: Vocab size for the chosen tokenizer type (BPE, Unigram). The vocab_size is optional for char, and mandatory for BPE & unigram tokenization. :type vocab_size: int, None, optional :param annotation_train: Path of the annotation file which is used to learn the tokenizer. It can be in JSON or csv format. :type annotation_train: str :param annotation_read: The data entry which contains the word sequence in the annotation file. :type annotation_read: str :param model_type: (bpe, char, unigram). If "bpe", train unsupervised tokenization of piece of words. see: https://www.aclweb.org/anthology/P16-1162/ If "word" take the vocabulary from the input text. If "unigram" do piece of word tokenization using unigram language model, see: https://arxiv.org/abs/1804.10959 :type model_type: str :param char_format_input: Whether the read entry contains characters format input. (default: False) (e.g., a p p l e _ i s _ g o o d) :type char_format_input: bool :param character_coverage: Amount of characters covered by the model, good defaults are: 0.9995 for languages with a rich character set like Japanese or Chinese and 1.0 for other languages with small character set. (default: 1.0) :type character_coverage: int :param user_defined_symbols: String contained a list of symbols separated by a comma. User-defined symbols are handled as one piece in any context. (default: None) :type user_defined_symbols: string :param max_sentencepiece_length: Maximum number of characters for the tokens. (default: 10) :type max_sentencepiece_length: int :param bos_id: If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1) :type bos_id: int :param eos_id: If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1) :type eos_id: int :param split_by_whitespace: If False, allow the sentencepiece to extract piece crossing multiple words. This feature is important for : Chinese/Japanese/Korean. (default: True) :type split_by_whitespace: bool :param num_sequences: If not none, use at most this many sequences to train the tokenizer (for large datasets). (default: None) :type num_sequences: int :param annotation_list_to_check: List of the annotation file which is used for checking the accuracy of recovering words from the tokenizer. :type annotation_list_to_check: list, :param annotation_format: The format of the annotation file. JSON or csv are the formats supported. :type annotation_format: str.

Reference

class speechbrain.tokenizers.SentencePiece.SentencePiece(model_dir, vocab_size, annotation_train=None, annotation_read=None, model_type='unigram', char_format_input=False, character_coverage=1.0, user_defined_symbols=None, max_sentencepiece_length=10, bos_id=- 1, eos_id=- 1, pad_id=- 1, unk_id=0, split_by_whitespace=True, num_sequences=None, annotation_list_to_check=None, annotation_format='csv')[source]

Bases: object

BPE class call the SentencePiece unsupervised text tokenizer from Google. Reference: https://github.com/google/sentencepiece SentencePiece lib is an unsupervised text tokenizer and detokenizer. It implements subword units like Byte-pair-encoding (BPE), Unigram language model and char/word tokenizer. :param model_dir: The directory where the model will be saved (or already stored). :type model_dir: str :param vocab_size: Vocab size for the chosen tokenizer type (BPE, Unigram).

The vocab_size is optional for char, and mandatory for BPE & unigram tokenization.

Parameters
  • annotation_train (str) – Path of the annotation file which is used to learn the tokenizer. It can be in JSON or csv format.

  • annotation_read (str) – The data entry which contains the word sequence in the annotation file.

  • model_type (str) – (bpe, char, unigram). If “bpe”, train unsupervised tokenization of piece of words. see: https://www.aclweb.org/anthology/P16-1162/ If “word” take the vocabulary from the input text. If “unigram” do piece of word tokenization using unigram language model, see: https://arxiv.org/abs/1804.10959

  • char_format_input (bool) – Whether the read entry contains characters format input. (default: False) (e.g., a p p l e _ i s _ g o o d)

  • character_coverage (int) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with a rich character set like Japanese or Chinese and 1.0 for other languages with small character set. (default: 1.0)

  • user_defined_symbols (string) – String contained a list of symbols separated by a comma. User-defined symbols are handled as one piece in any context. (default: None)

  • max_sentencepiece_length (int) – Maximum number of characters for the tokens. (default: 10)

  • bos_id (int) – If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1)

  • eos_id (int) – If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1)

  • split_by_whitespace (bool) – If False, allow the sentencepiece to extract piece crossing multiple words. This feature is important for : Chinese/Japanese/Korean. (default: True)

  • num_sequences (int) – If not none, use at most this many sequences to train the tokenizer (for large datasets). (default: None)

  • annotation_list_to_check (list,) – List of the annotation file which is used for checking the accuracy of recovering words from the tokenizer.

  • annotation_format (str) – The format of the annotation file. JSON or csv are the formats supported.

Example

>>> import torch
>>> dict_int2lab = {1: "HELLO", 2: "MORNING"}
>>> model_dir = "tests/unittests/tokenizer_data/"
>>> # Example with csv
>>> annotation_train = "tests/unittests/tokenizer_data/dev-clean.csv"
>>> annotation_read = "wrd"
>>> model_type = "bpe"
>>> bpe = SentencePiece(model_dir,100, annotation_train, annotation_read,
...                     model_type)
>>> batch_seq = torch.Tensor([[1, 2, 2, 1],[1, 2, 1, 0]])
>>> batch_lens = torch.Tensor([1.0, 0.75])
>>> encoded_seq_ids, encoded_seq_pieces = bpe(
...     batch_seq, batch_lens, dict_int2lab, task="encode"
... )
>>> # Example using JSON
>>> annotation_train = "tests/unittests/tokenizer_data/dev-clean.json"
>>> annotation_read = "wrd"
>>> bpe = SentencePiece(model_dir,100, annotation_train, annotation_read,
...                     model_type, annotation_format = 'json')
>>> encoded_seq_ids, encoded_seq_pieces = bpe(
...     batch_seq, batch_lens, dict_int2lab, task="encode"
... )
__call__(batch, batch_lens=None, ind2lab=None, task='encode')[source]

This __call__ function implements the tokenizer encoder and decoder (restoring the string of word) for BPE, Regularized BPE (with unigram), and char (speechbrain/nnet/RNN.py). :param batch: List if ( batch_lens = None and task = “decode_from_list”)

Contains the original labels. Shape: [batch_size, max_length]

Parameters
  • batch_lens (tensor.LongTensor) – Containing the relative length of each label sequences. Must be 1D tensor of shape: [batch_size]. (default: None)

  • ind2lab (dict) – Dictionary which maps the index from label sequences (batch tensor) to string label.

  • task (str) –

    (“encode”, “decode”, “decode_from_list) “encode”: convert the batch tensor into sequence of tokens.

    the output contain a list of (tokens_seq, tokens_lens)

    ”decode”: convert a tensor of tokens to a list of word sequences. “decode_from_list”: convert a list of token sequences to a list

    of word sequences.