speechbrain.utils.data_utils module¶

This library gathers utilities for data io operation.

Authors

Mirco Ravanelli 2020
Aku Rouhe 2020
Samuele Cornell 2020

Summary¶

Functions:

`batch_pad_right`	Given a list of torch tensors it batches them together by padding to the right on each dimension in order to get same length for all.
`download_file`	Downloads the file from the given source and saves it in the given destination path.
`get_all_files`	Returns a list of files found within a folder.
`mod_default_collate`	Makes a tensor from list of batch values.
`pad_right_to`	This function takes a torch tensor of arbitrary shape and pads it to target shape by appending values on the right.
`recursive_items`	Yield each (key, value) of a nested dictionary.
`recursive_to`	Moves data to device, or other type, and handles containers.
`recursive_update`	Similar function to dict.update, but for a nested dict.
`split_by_whitespace`	A very basic functional version of str.split
`split_list`	Returns a list of splits in the sequence.
`split_path`	Splits a path to source and filename
`undo_padding`	Produces Python lists given a batch of sentences with their corresponding relative lengths.

Reference¶

speechbrain.utils.data_utils.undo_padding(batch, lengths)[source]¶

Produces Python lists given a batch of sentences with their corresponding relative lengths.

Parameters

batch (tensor) – Batch of sentences gathered in a batch.
lengths (tensor) – Relative length of each sentence in the batch.

Example

>>> batch=torch.rand([4,100])
>>> lengths=torch.tensor([0.5,0.6,0.7,1.0])
>>> snt_list=undo_padding(batch, lengths)
>>> len(snt_list)
4

speechbrain.utils.data_utils.get_all_files(dirName, match_and=None, match_or=None, exclude_and=None, exclude_or=None)[source]¶

Returns a list of files found within a folder.

Different options can be used to restrict the search to some specific patterns.

Parameters

dirName (str) – The directory to search.
match_and (list) – A list that contains patterns to match. The file is returned if it matches all the entries in match_and.
match_or (list) – A list that contains patterns to match. The file is returned if it matches one or more of the entries in match_or.
exclude_and (list) – A list that contains patterns to match. The file is returned if it matches none of the entries in exclude_and.
exclude_or (list) – A list that contains pattern to match. The file is returned if it fails to match one of the entries in exclude_or.

Example

>>> get_all_files('samples/rir_samples', match_and=['3.wav'])
['samples/rir_samples/rir3.wav']

speechbrain.utils.data_utils.split_list(seq, num)[source]¶

Returns a list of splits in the sequence.

Parameters

seq (iterable) – The input list, to be split.
num (int) – The number of chunks to produce.

Example

>>> split_list([1, 2, 3, 4, 5, 6, 7, 8, 9], 4)
[[1, 2], [3, 4], [5, 6], [7, 8, 9]]

speechbrain.utils.data_utils.recursive_items(dictionary)[source]¶

Yield each (key, value) of a nested dictionary.

Parameters: dictionary (dict) – The nested dictionary to list.
Yields: (key, value) tuples from the dictionary.

Example

>>> rec_dict={'lev1': {'lev2': {'lev3': 'current_val'}}}
>>> [item for item in recursive_items(rec_dict)]
[('lev3', 'current_val')]

speechbrain.utils.data_utils.recursive_update(d, u, must_match=False)[source]¶

Similar function to dict.update, but for a nested dict.

From: https://stackoverflow.com/a/3233356

If you have to a nested mapping structure, for example:

{“a”: 1, “b”: {“c”: 2}}

Say you want to update the above structure with:

{“b”: {“d”: 3}}

This function will produce:

{“a”: 1, “b”: {“c”: 2, “d”: 3}}

Instead of:

{“a”: 1, “b”: {“d”: 3}}

Parameters

d (dict) – Mapping to be updated.
u (dict) – Mapping to update with.
must_match (bool) – Whether to throw an error if the key in u does not exist in d.

Example

>>> d = {'a': 1, 'b': {'c': 2}}
>>> recursive_update(d, {'b': {'d': 3}})
>>> d
{'a': 1, 'b': {'c': 2, 'd': 3}}

speechbrain.utils.data_utils.download_file(source, dest, unpack=False, dest_unpack=None, replace_existing=False)[source]¶

Downloads the file from the given source and saves it in the given destination path.

Arguments

sourcepath or url: Path of the source file. If the source is an URL, it downloads it from the web.
destpath: Destination path.
unpackbool: If True, it unpacks the data in the dest folder.
replace_existingbool: If True, replaces the existing files.

speechbrain.utils.data_utils.pad_right_to(tensor: torch.Tensor, target_shape: (<class 'list'>, <class 'tuple'>), mode='constant', value=0)[source]¶

This function takes a torch tensor of arbitrary shape and pads it to target shape by appending values on the right.

Parameters

tensor (input torch tensor) – Input tensor whose dimension we need to pad.
target_shape ((list, tuple)) – Target shape we want for the target tensor its len must be equal to tensor.ndim
mode (str) – Pad mode, please refer to torch.nn.functional.pad documentation.
value (float) – Pad value, please refer to torch.nn.functional.pad documentation.

Returns

tensor (torch.Tensor) – Padded tensor.
valid_vals (list) – List containing proportion for each dimension of original, non-padded values.

speechbrain.utils.data_utils.batch_pad_right(tensors: list, mode='constant', value=0)[source]¶

Given a list of torch tensors it batches them together by padding to the right on each dimension in order to get same length for all.

Parameters

tensors (list) – List of tensor we wish to pad together.
mode (str) – Padding mode see torch.nn.functional.pad documentation.
value (float) – Padding value see torch.nn.functional.pad documentation.

Returns

tensor (torch.Tensor) – Padded tensor.
valid_vals (list) – List containing proportion for each dimension of original, non-padded values.

speechbrain.utils.data_utils.split_by_whitespace(text)[source]¶: A very basic functional version of str.split

speechbrain.utils.data_utils.recursive_to(data, *args, **kwargs)[source]¶

Moves data to device, or other type, and handles containers.

Very similar to torch.utils.data._utils.pin_memory.pin_memory, but applies .to() instead.

speechbrain.utils.data_utils.mod_default_collate(batch)[source]¶

Makes a tensor from list of batch values.

Note that this doesn’t need to zip(*) values together as PaddedBatch connects them already (by key).

Here the idea is not to error out.

This is modified from: https://github.com/pytorch/pytorch/blob/c0deb231db76dbea8a9d326401417f7d1ce96ed5/torch/utils/data/_utils/collate.py#L42

speechbrain.utils.data_utils.split_path(path)[source]¶

Splits a path to source and filename

This also handles URLs and Huggingface hub paths, in addition to regular paths.

Parameters

path (str) –

Returns

str – Source
str – Filename