speechbrain.tokenizers.SentencePiece module
Library for Byte-pair-encoding (BPE) tokenization. Authors
Abdelwahab Heba 2020
Loren Lugosch 2020
Summary
Classes:
BPE class call the SentencePiece unsupervised text tokenizer from Google. Reference: https://github.com/google/sentencepiece SentencePiece lib is an unsupervised text tokenizer and detokenizer. It implements subword units like Byte-pair-encoding (BPE), Unigram language model and char/word tokenizer. :param model_dir: The directory where the model will be saved (or already stored). :type model_dir: str :param vocab_size: Vocab size for the chosen tokenizer type (BPE, Unigram). The vocab_size is optional for char, and mandatory for BPE & unigram tokenization. :type vocab_size: int, None, optional :param annotation_train: Path of the annotation file which is used to learn the tokenizer. It can be in JSON or csv format. :type annotation_train: str :param annotation_read: The data entry which contains the word sequence in the annotation file. :type annotation_read: str :param model_type: (bpe, char, unigram). If "bpe", train unsupervised tokenization of piece of words. see: https://www.aclweb.org/anthology/P16-1162/ If "word" take the vocabulary from the input text. If "unigram" do piece of word tokenization using unigram language model, see: https://arxiv.org/abs/1804.10959 :type model_type: str :param char_format_input: Whether the read entry contains characters format input. (default: False) (e.g., a p p l e _ i s _ g o o d) :type char_format_input: bool :param character_coverage: Amount of characters covered by the model, good defaults are: 0.9995 for languages with a rich character set like Japanese or Chinese and 1.0 for other languages with small character set. (default: 1.0) :type character_coverage: int :param user_defined_symbols: String contained a list of symbols separated by a comma. User-defined symbols are handled as one piece in any context. (default: None) :type user_defined_symbols: string :param max_sentencepiece_length: Maximum number of characters for the tokens. (default: 10) :type max_sentencepiece_length: int :param bos_id: If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1) :type bos_id: int :param eos_id: If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1) :type eos_id: int :param split_by_whitespace: If False, allow the sentencepiece to extract piece crossing multiple words. This feature is important for : Chinese/Japanese/Korean. (default: True) :type split_by_whitespace: bool :param num_sequences: If not none, use at most this many sequences to train the tokenizer (for large datasets). (default: None) :type num_sequences: int :param annotation_list_to_check: List of the annotation file which is used for checking the accuracy of recovering words from the tokenizer. :type annotation_list_to_check: list, :param annotation_format: The format of the annotation file. JSON or csv are the formats supported. :type annotation_format: str :param text_file: An alternate path to the text file (needed when multiple models are trained on the same data file) :type text_file: str :param add_dummy_prefix: If True the tokenizer adds dummy whitespace at the beginning of text. (default: True) :type add_dummy_prefix: bool. |
Reference
- class speechbrain.tokenizers.SentencePiece.SentencePiece(model_dir, vocab_size, annotation_train=None, annotation_read=None, model_type='unigram', char_format_input=False, character_coverage=1.0, user_defined_symbols=None, max_sentencepiece_length=10, bos_id=-1, eos_id=-1, pad_id=-1, unk_id=0, split_by_whitespace=True, num_sequences=None, annotation_list_to_check=None, annotation_format='csv', text_file=None, add_dummy_prefix=True)[source]
Bases:
object
BPE class call the SentencePiece unsupervised text tokenizer from Google. Reference: https://github.com/google/sentencepiece SentencePiece lib is an unsupervised text tokenizer and detokenizer. It implements subword units like Byte-pair-encoding (BPE), Unigram language model and char/word tokenizer. :param model_dir: The directory where the model will be saved (or already stored). :type model_dir: str :param vocab_size: Vocab size for the chosen tokenizer type (BPE, Unigram).
The vocab_size is optional for char, and mandatory for BPE & unigram tokenization.
- Parameters:
annotation_train (str) – Path of the annotation file which is used to learn the tokenizer. It can be in JSON or csv format.
annotation_read (str) – The data entry which contains the word sequence in the annotation file.
model_type (str) – (bpe, char, unigram). If “bpe”, train unsupervised tokenization of piece of words. see: https://www.aclweb.org/anthology/P16-1162/ If “word” take the vocabulary from the input text. If “unigram” do piece of word tokenization using unigram language model, see: https://arxiv.org/abs/1804.10959
char_format_input (bool) – Whether the read entry contains characters format input. (default: False) (e.g., a p p l e _ i s _ g o o d)
character_coverage (int) – Amount of characters covered by the model, good defaults are: 0.9995 for languages with a rich character set like Japanese or Chinese and 1.0 for other languages with small character set. (default: 1.0)
user_defined_symbols (string) – String contained a list of symbols separated by a comma. User-defined symbols are handled as one piece in any context. (default: None)
max_sentencepiece_length (int) – Maximum number of characters for the tokens. (default: 10)
bos_id (int) – If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1)
eos_id (int) – If -1 the bos_id = unk_id = 0. otherwise, bos_id = int. (default: -1)
split_by_whitespace (bool) – If False, allow the sentencepiece to extract piece crossing multiple words. This feature is important for : Chinese/Japanese/Korean. (default: True)
num_sequences (int) – If not none, use at most this many sequences to train the tokenizer (for large datasets). (default: None)
annotation_list_to_check (list,) – List of the annotation file which is used for checking the accuracy of recovering words from the tokenizer.
annotation_format (str) – The format of the annotation file. JSON or csv are the formats supported.
text_file (str) – An alternate path to the text file (needed when multiple models are trained on the same data file)
add_dummy_prefix (bool) – If True the tokenizer adds dummy whitespace at the beginning of text. (default: True)
Example
>>> import torch >>> dict_int2lab = {1: "HELLO", 2: "MORNING"} >>> model_dir = getfixture('tmpdir') / "tokenizer_data" >>> # Example with csv >>> annotation_train = "tests/samples/annotation/dev-clean.csv" >>> annotation_read = "wrd" >>> model_type = "bpe" >>> bpe = SentencePiece(str(model_dir), 100, annotation_train, annotation_read, model_type) >>> batch_seq = torch.Tensor([[1, 2, 2, 1],[1, 2, 1, 0]]) >>> batch_lens = torch.Tensor([1.0, 0.75]) >>> encoded_seq_ids, encoded_seq_pieces = bpe( ... batch_seq, batch_lens, dict_int2lab, task="encode" ... ) >>> # Example using JSON >>> annotation_train = str(model_dir + "/dev-clean.json") >>> annotation_read = "wrd" >>> bpe = SentencePiece(model_dir, 100, annotation_train, annotation_read, model_type, annotation_format = 'json') >>> encoded_seq_ids, encoded_seq_pieces = bpe( ... batch_seq, batch_lens, dict_int2lab, task="encode" ... )
- __call__(batch, batch_lens=None, ind2lab=None, task='encode')[source]
This __call__ function implements the tokenizer encoder and decoder (restoring the string of word) for BPE, Regularized BPE (with unigram), and char (speechbrain/nnet/RNN.py). :param batch: List if ( batch_lens = None and task = “decode_from_list”)
Contains the original labels. Shape: [batch_size, max_length]
- Parameters:
batch_lens (tensor.LongTensor) – Containing the relative length of each label sequences. Must be 1D tensor of shape: [batch_size]. (default: None)
ind2lab (dict) – Dictionary which maps the index from label sequences (batch tensor) to string label.
task (str) –
(“encode”, “decode”, “decode_from_list) “encode”: convert the batch tensor into sequence of tokens.
the output contain a list of (tokens_seq, tokens_lens)
”decode”: convert a tensor of tokens to a list of word sequences. “decode_from_list”: convert a list of token sequences to a list
of word sequences.