speechbrain.tokenizers.SentencePiece module

Library for Byte-pair-encoding (BPE) tokenization.

Authors
  • Abdelwahab Heba 2020

  • Loren Lugosch 2020

Summary

Classes:

SentencePiece

BPE class call the SentencePiece unsupervised text tokenizer from Google.

Reference

class speechbrain.tokenizers.SentencePiece.SentencePiece(model_dir, vocab_size, annotation_train=None, annotation_read=None, model_type='unigram', char_format_input=False, character_coverage=1.0, user_defined_symbols=None, max_sentencepiece_length=10, bos_id=- 1, eos_id=- 1, pad_id=- 1, unk_id=0, split_by_whitespace=True, num_sequences=None, annotation_list_to_check=None, annotation_format='csv')[source]

Bases: object

BPE class call the SentencePiece unsupervised text tokenizer from Google.

Reference: https://github.com/google/sentencepiece

SentencePiece lib is an unsupervised text tokenizer and detokenizer. It implements subword units like Byte-pair-encoding (BPE), Unigram language model and char/word tokenizer.

Parameters
  • model_dir (str) – The directory where the model will be saved (or already stored).

  • vocab_size (int, None, optional) – Vocab size for the chosen tokenizer type (BPE, Unigram). The vocab_size is optional for char, and mandatory for BPE & unigram tokenization.

  • annotation_train (str) – Path of the annotation file which is used to learn the tokenizer. It can be in JSON or csv format.

  • annotation_read (str) – The data entry which contains the word sequence in the annotation file.

  • model_type (str) – (bpe, char, unigram). If “bpe”, train unsupervised tokenization of piece of words. see: https://www.aclweb.org/anthology/P16-1162/ If “word” take the vocabulary from the input text. If “unigram” do piece of word tokenization using unigram language model, see: https://arxiv.org/abs/1804.10959

  • char_format_input (bool) – Whether the read entry contains characters format input. (default: False) (e.g., a p p l e _ i s _ g o o d)

  • character_coverage (int) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with a rich character set like Japanese or Chinese and 1.0 for other languages with small character set. (default: 1.0)

  • user_defined_symbols (string) – String contained a list of symbols separated by a comma. User-defined symbols are handled as one piece in any context. (default: None)

  • max_sentencepiece_length (int) – Maximum number of characters for the tokens. (default: 10)

  • bos_id (int) – If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1)

  • eos_id (int) – If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1)

  • split_by_whitespace (bool) – If False, allow the sentencepiece to extract piece crossing multiple words. This feature is important for : Chinese/Japanese/Korean. (default: True)

  • num_sequences (int) – If not none, use at most this many sequences to train the tokenizer (for large datasets). (default: None)

  • annotation_list_to_check (list,) – List of the annotation file which is used for checking the accuracy of recovering words from the tokenizer.

  • annotation_format (str) – The format of the annotation file. JSON or csv are the formats supported.

Example

>>> import torch
>>> dict_int2lab = {1: "HELLO", 2: "MORNING"}
>>> model_dir = "tests/unittests/tokenizer_data/"
>>> # Example with csv
>>> annotation_train = "tests/unittests/tokenizer_data/dev-clean.csv"
>>> annotation_read = "wrd"
>>> model_type = "bpe"
>>> bpe = SentencePiece(model_dir,100, annotation_train, annotation_read,
...                     model_type)
>>> batch_seq = torch.Tensor([[1, 2, 2, 1],[1, 2, 1, 0]])
>>> batch_lens = torch.Tensor([1.0, 0.75])
>>> encoded_seq_ids, encoded_seq_pieces = bpe(
...     batch_seq, batch_lens, dict_int2lab, task="encode"
... )
>>> # Example using JSON
>>> annotation_train = "tests/unittests/tokenizer_data/dev-clean.json"
>>> annotation_read = "wrd"
>>> bpe = SentencePiece(model_dir,100, annotation_train, annotation_read,
...                     model_type, annotation_format = 'json')
>>> encoded_seq_ids, encoded_seq_pieces = bpe(
...     batch_seq, batch_lens, dict_int2lab, task="encode"
... )
__call__(batch, batch_lens=None, ind2lab=None, task='encode')[source]

This __call__ function implements the tokenizer encoder and decoder (restoring the string of word) for BPE, Regularized BPE (with unigram), and char (speechbrain/nnet/RNN.py).

Parameters
  • batch (tensor.IntTensor or list) – List if ( batch_lens = None and task = “decode_from_list”) Contains the original labels. Shape: [batch_size, max_length]

  • batch_lens (tensor.LongTensor) – Containing the relative length of each label sequences. Must be 1D tensor of shape: [batch_size]. (default: None)

  • ind2lab (dict) – Dictionary which maps the index from label sequences (batch tensor) to string label.

  • task (str) –

    (“encode”, “decode”, “decode_from_list) “encode”: convert the batch tensor into sequence of tokens.

    the output contain a list of (tokens_seq, tokens_lens)

    ”decode”: convert a tensor of tokens to a list of word sequences. “decode_from_list”: convert a list of token sequences to a list

    of word sequences.