speechbrain.dataio.iterators module

Webdataset compatible iterators

Authors:
  • Aku Rouhe 2021

Summary

Classes:

LengthItem

RatioIndex

Functions:

dynamic_bucketed_batch

Produce batches from a sorted buffer

indices_around_random_pivot

Random pivot sampler_fn for dynamic_bucketed_batch

padding_ratio

total_length_with_padding

Reference

class speechbrain.dataio.iterators.LengthItem(length: int, data: Any)[source]

Bases: object

length: int
data: Any
speechbrain.dataio.iterators.total_length_with_padding(lengths)[source]
speechbrain.dataio.iterators.padding_ratio(lengths)[source]
class speechbrain.dataio.iterators.RatioIndex(ratio: float, index: int)[source]

Bases: object

ratio: float
index: int
speechbrain.dataio.iterators.indices_around_random_pivot(databuffer, target_batch_numel, max_batch_size=None, max_batch_numel=None, max_padding_ratio=0.2, randint_generator=<bound method Random.randint of <random.Random object>>)[source]

Random pivot sampler_fn for dynamic_bucketed_batch

Create a batch around a random pivot index in the sorted buffer

This works on the databuffer which is assumed to be in sorted order. An index is chosen at random. This starts the window of indices: at first, only the randomly chosen pivot index is included. The window of indices is grown one-index-at-a-time, picking either the index to the right of the window, or the index to the left, picking the index that would increase the padding ratio the least, and making sure the batch wouldn’t exceed the maximum batch length nor the maximum padding ratio.

Parameters
  • databuffer (list) – Sorted list of LengthItems

  • target_batch_numel (int) – Target of total batch length including padding, which is simply computed as batch size * length of longest example. This function aims to return the batch as soon as the gathered length exceeds this. If some limits are encountered first, this may not be satisifed.

  • max_batch_size (None, int) – Maximum number of examples to include in the batch, or None to not limit by number of examples.

  • max_batch_numel (None, int) – Maximum of total batch length including padding, which is simply computed as batch size * length of longest example.

speechbrain.dataio.iterators.dynamic_bucketed_batch(data, len_key=None, len_fn=<built-in function len>, min_sample_len=None, max_sample_len=None, buffersize=1024, collate_fn=<class 'speechbrain.dataio.batch.PaddedBatch'>, sampler_fn=<function indices_around_random_pivot>, sampler_kwargs={}, drop_end=False)[source]

Produce batches from a sorted buffer

This function keeps a sorted buffer of the incoming samples. The samples can be filtered for min/max length. An external sampler is used to choose samples for each batch, which allows different dynamic batching algorithms to be used.

Parameters
  • data (iterable) – An iterable source of samples, such as an IterableDataset.

  • len_key (str, None) – The key in the sample dict to use to fetch the length of the sample, or None if no key should be used.

  • len_fn (callable) – Called with sample[len_key] if len_key is not None, else sample. Needs to return the sample length as an integer.

  • min_sample_len (int, None) – Discard samples with length lower than this. If None, no minimum is applied.

  • max_sample_len (int, None) – Discard samples with length larger than this. If None, no maximum is applied.

  • buffersize (int) – The size of the internal sorted buffer. The buffer is always filled up before yielding a batch of samples.

  • collate_fn (callable) – Called with a list of samples. This should return a batch. By default, using the SpeechBrain PaddedBatch class, which works for dict-like samples, and pads any tensors.

  • sampler_fn (callable) – Called with the sorted data buffer. Needs to return a list of indices, which make up the next batch. By default using indices_around_random_pivot

  • sampler_kwargs (dict) – Keyword arguments, passed to sampler_fn.

  • drop_end (bool) – After the data stream is exhausted, should batches be made until the data buffer is exhausted, or should the rest of the buffer be discarded. Without new samples, the last batches might not be efficient to process. Note: you can use .repeat on webdataset IterableDatasets to never run out of new samples, and then use speechbrain.dataio.dataloader.LoopedLoader to set a nominal epoch length.