speechbrain.dataio.encoder module

Encoding categorical data as integers

Authors

Samuele Cornell 2020
Aku Rouhe 2020

Summary

Classes:

`CTCTextEncoder`	Subclass of TextEncoder which also provides methods to handle CTC blank token.
`CategoricalEncoder`	Encode labels of a discrete set.
`TextEncoder`	CategoricalEncoder subclass which offers specific methods for encoding text and handle special tokens for training of sequence to sequence models.

Functions:

load_text_encoder_tokens

Loads the encoder tokens from a pretrained model.

Reference

class speechbrain.dataio.encoder.CategoricalEncoder(starting_index=0, **special_labels)[source]

Bases: object

Encode labels of a discrete set.

Used for encoding, e.g., speaker identities in speaker recognition. Given a collection of hashables (e.g a strings) it encodes every unique item to an integer value: [“spk0”, “spk1”] –> [0, 1] Internally the correspondence between each label to its index is handled by two dictionaries: lab2ind and ind2lab.

The label integer encoding can be generated automatically from a SpeechBrain DynamicItemDataset by specifying the desired entry (e.g., spkid) in the annotation and calling update_from_didataset method:

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = {"ex_{}".format(x) : {"spkid" : "spk{}".format(x)} for x in range(20)}
>>> dataset = DynamicItemDataset(dataset)
>>> encoder = CategoricalEncoder()
>>> encoder.update_from_didataset(dataset, "spkid")
>>> assert len(encoder) == len(dataset) # different speaker for each utterance

However can also be updated from an iterable:

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = ["spk{}".format(x) for x in range(20)]
>>> encoder = CategoricalEncoder()
>>> encoder.update_from_iterable(dataset)
>>> assert len(encoder) == len(dataset)

Note

In both methods it can be specified it the single element in the iterable or in the dataset should be treated as a sequence or not (default False). If it is a sequence each element in the sequence will be encoded.

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = [[x+1, x+2] for x in range(20)]
>>> encoder = CategoricalEncoder()
>>> encoder.ignore_len()
>>> encoder.update_from_iterable(dataset, sequence_input=True)
>>> assert len(encoder) == 21 # there are only 21 unique elements 1-21

This class offers 4 different methods to explicitly add a label in the internal dicts: add_label, ensure_label, insert_label, enforce_label. add_label and insert_label will raise an error if it is already present in the internal dicts. insert_label, enforce_label allow also to specify the integer value to which the desired label is encoded.

Encoding can be performed using 4 different methods: encode_label, encode_sequence, encode_label_torch and encode_sequence_torch. encode_label operate on single labels and simply returns the corresponding integer encoding:

>>> from speechbrain.dataio.encoder import CategoricalEncoder
>>> from speechbrain.dataio.dataset import DynamicItemDataset
>>> dataset = ["spk{}".format(x) for x in range(20)]
>>> encoder.update_from_iterable(dataset)
>>>
22
>>>
encode_sequence on sequences of labels:
>>> encoder.encode_sequence(["spk1", "spk19"])
[22, 40]
>>>
encode_label_torch and encode_sequence_torch return torch tensors
>>> encoder.encode_sequence_torch(["spk1", "spk19"])
tensor([22, 40])
>>>
Decoding can be performed using decode_torch and decode_ndim methods.
>>> encoded = encoder.encode_sequence_torch(["spk1", "spk19"])
>>> encoder.decode_torch(encoded)
['spk1', 'spk19']
>>>
decode_ndim is used for multidimensional list or pytorch tensors
>>> encoded = encoded.unsqueeze(0).repeat(3, 1)
>>> encoder.decode_torch(encoded)
[['spk1', 'spk19'], ['spk1', 'spk19'], ['spk1', 'spk19']]
>>>

In some applications, it can happen that during testing a label which has not been encountered during training is encountered. To handle this out-of-vocabulary problem add_unk can be used. Every out-of-vocab label is mapped to this special <unk> label and its corresponding integer encoding.

>>> import torch
>>> try:
...     encoder.encode_label("spk42")
... except KeyError:
...        print("spk42 is not in the encoder this raises an error!")
spk42 is not in the encoder this raises an error!
>>> encoder.add_unk()
41
>>> encoder.encode_label("spk42")
41
>>>
returns the <unk> encoding

This class offers also methods to save and load the internal mappings between labels and tokens using: save and load methods as well as load_or_create.

VALUE_SEPARATOR = ' => '

EXTRAS_SEPARATOR = '================\n'

handle_special_labels(special_labels)[source]: Handles special labels such as unk_label.

classmethod from_saved(path)[source]: Recreate a previously saved encoder directly

update_from_iterable(iterable, sequence_input=False)[source]

Update from iterator

Parameters:

iterable (iterable) – Input sequence on which to operate.
sequence_input (bool) – Whether iterable yields sequences of labels or individual labels directly. (default False)

update_from_didataset(didataset, output_key, sequence_input=False)[source]

Update from DynamicItemDataset.

Parameters:

didataset (DynamicItemDataset) – Dataset on which to operate.
output_key (str) – Key in the dataset (in data or a dynamic item) to encode.
sequence_input (bool) – Whether the data yielded with the specified key consists of sequences of labels or individual labels directly.

limited_labelset_from_iterable(iterable, sequence_input=False, n_most_common=None, min_count=1)[source]

Produce label mapping from iterable based on label counts

Used to limit label set size.

Parameters:

iterable (iterable) – Input sequence on which to operate.
sequence_input (bool) – Whether iterable yields sequences of labels or individual labels directly. False by default.
n_most_common (int, None) – Take at most this many labels as the label set, keeping the most common ones. If None (as by default), take all.
min_count (int) – Don’t take labels if they appear less than this many times.

Returns:

The counts of the different labels (unfiltered).

Return type:

collections.Counter

load_or_create(path, from_iterables=[], from_didatasets=[], sequence_input=False, output_key=None, special_labels={})[source]

Convenient syntax for creating the encoder conditionally

This pattern would be repeated in so many experiments that we decided to add a convenient shortcut for it here. The current version is multi-gpu (DDP) safe.

add_label(label)[source]

Add new label to the encoder, at the next free position.

Parameters:: label (hashable) – Most often labels are str, but anything that can act as dict key is supported. Note that default save/load only supports Python literals.
Returns:: The index that was used to encode this label.
Return type:: int

ensure_label(label)[source]

Add a label if it is not already present.

Parameters:: label (hashable) – Most often labels are str, but anything that can act as dict key is supported. Note that default save/load only supports Python literals.
Returns:: The index that was used to encode this label.
Return type:: int

insert_label(label, index)[source]

Add a new label, forcing its index to a specific value.

If a label already has the specified index, it is moved to the end of the mapping.

Parameters:

label (hashable) – Most often labels are str, but anything that can act as dict key is supported. Note that default save/load only supports Python literals.
index (int) – The specific index to use.

enforce_label(label, index)[source]

Make sure label is present and encoded to a particular index.

If the label is present but encoded to some other index, it is moved to the given index.

If there is already another label at the given index, that label is moved to the next free position.

add_unk(unk_label='<unk>')[source]

Add label for unknown tokens (out-of-vocab).

When asked to encode unknown labels, they can be mapped to this.

Parameters:: label (hashable, optional) – Most often labels are str, but anything that can act as dict key is supported. Note that default save/load only supports Python literals. Default: <unk>. This can be None, as well!
Returns:: The index that was used to encode this.
Return type:: int

is_continuous()[source]

Check that the set of indices doesn’t have gaps

For example: If starting index = 1 Continuous: [1,2,3,4] Continuous: [0,1,2] Non-continuous: [2,3,4] Non-continuous: [1,2,4]

Returns:: True if continuous.
Return type:: bool

encode_label(label, allow_unk=True)[source]

Encode label to int

Parameters:

label (hashable) – Label to encode, must exist in the mapping.
allow_unk (bool) – If given, that label is not in the label set AND unk_label has been added with add_unk(), allows encoding to unk_label’s index.

Returns:

Corresponding encoded int value.

Return type:

int

encode_label_torch(label, allow_unk=True)[source]

Encode label to torch.LongTensor.

Parameters:: label (hashable) – Label to encode, must exist in the mapping.
Returns:: Corresponding encoded int value. Tensor shape [1].
Return type:: torch.LongTensor

encode_sequence(sequence, allow_unk=True)[source]

Encode a sequence of labels to list

Parameters:: x (iterable) – Labels to encode, must exist in the mapping.
Returns:: Corresponding integer labels.
Return type:: list

encode_sequence_torch(sequence, allow_unk=True)[source]

Encode a sequence of labels to torch.LongTensor

Parameters:: x (iterable) – Labels to encode, must exist in the mapping.
Returns:: Corresponding integer labels. Tensor shape [len(sequence)].
Return type:: torch.LongTensor

decode_torch(x)[source]

Decodes an arbitrarily nested torch.Tensor to a list of labels.

Provided separately because Torch provides clearer introspection, and so doesn’t require try-except.

Parameters:: x (torch.Tensor) – Torch tensor of some integer dtype (Long, int) and any shape to decode.
Returns:: list of original labels
Return type:: list

decode_ndim(x)[source]

Decodes an arbitrarily nested iterable to a list of labels.

This works for essentially any pythonic iterable (including torch), and also single elements.

Parameters:: x (Any) – Python list or other iterable or torch.Tensor or a single integer element
Returns:: ndim list of original labels, or if input was single element, output will be, too.
Return type:: list, Any

save(path)[source]

Save the categorical encoding for later use and recovery

Saving uses a Python literal format, which supports things like tuple labels, but is considered safe to load (unlike e.g. pickle).

Parameters:: path (str, Path) – Where to save. Will overwrite.

load(path)[source]

Loads from the given path.

CategoricalEncoder uses a Python literal format, which supports things like tuple labels, but is considered safe to load (unlike e.g. pickle).

Parameters:: path (str, Path) – Where to load from.

load_if_possible(path, end_of_epoch=False)[source]

Loads if possible, returns a bool indicating if loaded or not.

Parameters:: path (str, Path) – Where to load from.
Returns:: If load was successful.
Return type:: bool

Example

>>> encoding_file = getfixture('tmpdir') / "encoding.txt"
>>> encoder = CategoricalEncoder()
>>> # The idea is in an experiment script to have something like this:
>>> if not encoder.load_if_possible(encoding_file):
...     encoder.update_from_iterable("abcd")
...     encoder.save(encoding_file)
>>> # So the first time you run the experiment, the encoding is created.
>>> # However, later, the encoding exists:
>>> encoder = CategoricalEncoder()
>>> encoder.expect_len(4)
>>> if not encoder.load_if_possible(encoding_file):
...     assert False  # We won't get here!
>>> encoder.decode_ndim(range(4))
['a', 'b', 'c', 'd']

expect_len(expected_len)[source]

Specify the expected category count. If the category count observed during encoding/decoding does NOT match this, an error will be raised.

This can prove useful to detect bugs in scenarios where the encoder is dynamically built using a dataset, but downstream code expects a specific category count (and may silently break otherwise).

This can be called anytime and the category count check will only be performed during an actual encoding/decoding task.

Parameters:: expected_len (int) – The expected final category count, i.e. len(encoder).

Example

>>> encoder = CategoricalEncoder()
>>> encoder.update_from_iterable("abcd")
>>> encoder.expect_len(3)
>>> encoder.encode_label("a")
Traceback (most recent call last):
  ...
RuntimeError: .expect_len(3) was called, but 4 categories found
>>> encoder.expect_len(4)
>>> encoder.encode_label("a")
0

ignore_len()[source]

Specifies that category count shall be ignored at encoding/decoding time.

Effectively inhibits the “.expect_len was never called” warning. Prefer expect_len() when the category count is known.

class speechbrain.dataio.encoder.TextEncoder(starting_index=0, **special_labels)[source]

Bases: CategoricalEncoder

CategoricalEncoder subclass which offers specific methods for encoding text and handle special tokens for training of sequence to sequence models. In detail, aside special <unk> token already present in CategoricalEncoder for handling out-of-vocab tokens here special methods to handle <bos> beginning of sequence and <eos> tokens are defined.

Note: update_from_iterable and update_from_didataset here have as default sequence_input=True because it is assumed that this encoder is used on iterables of strings: e.g.

>>> from speechbrain.dataio.encoder import TextEncoder
>>> dataset = [["encode", "this", "textencoder"], ["foo", "bar"]]
>>> encoder = TextEncoder()
>>> encoder.update_from_iterable(dataset)
>>> encoder.expect_len(5)
>>> encoder.encode_label("this")
1
>>> encoder.add_unk()
5
>>> encoder.expect_len(6)
>>> encoder.encode_sequence(["this", "out-of-vocab"])
[1, 5]
>>>

Two methods can be used to add <bos> and <eos> to the internal dicts: insert_bos_eos, add_bos_eos.

>>> encoder.add_bos_eos()
>>> encoder.expect_len(8)
>>> encoder.lab2ind[encoder.eos_label]
7
>>>
add_bos_eos adds the special tokens at the end of the dict indexes
>>> encoder = TextEncoder()
>>> encoder.update_from_iterable(dataset)
>>> encoder.insert_bos_eos(bos_index=0, eos_index=1)
>>> encoder.expect_len(7)
>>> encoder.lab2ind[encoder.eos_label]
1
>>>
insert_bos_eos allows to specify whose index will correspond to each of them.
Note that you can also specify the same integer encoding for both.

Four methods can be used to prepend <bos> and append <eos>. prepend_bos_label and append_eos_label add respectively the <bos> and <eos> string tokens to the input sequence

>>> words = ["foo", "bar"]
>>> encoder.prepend_bos_label(words)
['<bos>', 'foo', 'bar']
>>> encoder.append_eos_label(words)
['foo', 'bar', '<eos>']

prepend_bos_index and append_eos_index add respectively the <bos> and <eos> indexes to the input encoded sequence.

>>> words = ["foo", "bar"]
>>> encoded = encoder.encode_sequence(words)
>>> encoder.prepend_bos_index(encoded)
[0, 3, 4]
>>> encoder.append_eos_index(encoded)
[3, 4, 1]

handle_special_labels(special_labels)[source]: Handles special labels such as bos and eos.

update_from_iterable(iterable, sequence_input=True)[source]: Change default for sequence_input to True.

update_from_didataset(didataset, output_key, sequence_input=True)[source]: Change default for sequence_input to True.

limited_labelset_from_iterable(iterable, sequence_input=True, n_most_common=None, min_count=1)[source]: Change default for sequence_input to True.

add_bos_eos(bos_label='<bos>', eos_label='<eos>')[source]

Add sentence boundary markers in the label set.

If the beginning-of-sentence and end-of-sentence markers are the same, will just use one sentence-boundary label.

This method adds to the end of the index, rather than at the beginning, like insert_bos_eos.

Parameters:

bos_label (hashable) – Beginning-of-sentence label, any label.
eos_label (hashable) – End-of-sentence label, any label. If set to the same label as bos_label, will just use one sentence-boundary label.

insert_bos_eos(bos_label='<bos>', eos_label='<eos>', bos_index=0, eos_index=None)[source]

Insert sentence boundary markers in the label set.

If the beginning-of-sentence and end-of-sentence markers are the same, will just use one sentence-boundary label.

Parameters:

bos_label (hashable) – Beginning-of-sentence label, any label
eos_label (hashable) – End-of-sentence label, any label. If set to the same label as bos_label, will just use one sentence-boundary label.
bos_index (optional, int) – Where to insert bos_label. eos_index = bos_index + 1
bos_index – Where to insert eos_label. Default: eos_index = bos_index + 1

get_bos_index()[source]: Returns the index to which blank encodes

get_eos_index()[source]: Returns the index to which blank encodes

prepend_bos_label(x)[source]: Returns a list version of x, with BOS prepended

prepend_bos_index(x)[source]: Returns a list version of x, with BOS index prepended. If the input is a tensor, a tensor is returned.

append_eos_label(x)[source]: Returns a list version of x, with EOS appended.

append_eos_index(x)[source]: Returns a list version of x, with EOS index appended. If the input is a tensor, a tensor is returned.

class speechbrain.dataio.encoder.CTCTextEncoder(starting_index=0, **special_labels)[source]

Bases: TextEncoder

Subclass of TextEncoder which also provides methods to handle CTC blank token.

add_blank and insert_blank can be used to add <blank> special token to the encoder state.

>>> from speechbrain.dataio.encoder import CTCTextEncoder
>>> chars = ["a", "b", "c", "d"]
>>> encoder = CTCTextEncoder()
>>> encoder.update_from_iterable(chars)
>>> encoder.add_blank()
>>> encoder.expect_len(5)
>>> encoder.encode_sequence(chars)
[0, 1, 2, 3]
>>> encoder.get_blank_index()
4
>>> encoder.decode_ndim([0, 1, 2, 3, 4])
['a', 'b', 'c', 'd', '<blank>']

collapse_labels and collapse_indices_ndim can be used to apply CTC collapsing rules: >>> encoder.collapse_labels([“a”, “a”, “b”, “c”, “d”]) [‘a’, ‘b’, ‘c’, ‘d’] >>> encoder.collapse_indices_ndim([4, 4, 0, 1, 2, 3, 4, 4]) # 4 is <blank> [0, 1, 2, 3]

handle_special_labels(special_labels)[source]: Handles special labels such as blanks.

add_blank(blank_label='<blank>')[source]: Add blank symbol to labelset.

insert_blank(blank_label='<blank>', index=0)[source]: Insert blank symbol at a given labelset.

get_blank_index()[source]: Returns the index to which blank encodes.

collapse_labels(x, merge_repeats=True)[source]

Applies the CTC collapsing rules on one label sequence.

Parameters:

x (iterable) – Label sequence on which to operate.
merge_repeats (bool) – Whether to merge repeated labels before removing blanks. In the basic CTC label topology, repeated labels are merged. However, in RNN-T, they are not.

Returns:

List of labels with collapsing rules applied.

Return type:

list

collapse_indices_ndim(x, merge_repeats=True)[source]

Applies the CTC collapsing rules on arbitrarily label sequence.

Parameters:

x (iterable) – Label sequence on which to operate.
merge_repeats (bool) – Whether to merge repeated labels before removing blanks. In the basic CTC label topology, repeated labels are merged. However, in RNN-T, they are not.

Returns:

List of labels with collapsing rules applied.

Return type:

list

speechbrain.dataio.encoder.load_text_encoder_tokens(model_path)[source]

Loads the encoder tokens from a pretrained model.

This method is useful when you used with a pretrained HF model. It will load the tokens in the yaml and then you will be able to instantiate any CTCBaseSearcher directly in the YAML file.

Parameters:: model_path (str, Path) – Path to the pretrained model.
Returns:: List of tokens.
Return type:: list