speechbrain.utils.data_utils module
This library gathers utilities for data io operation.
- Authors
Mirco Ravanelli 2020
Aku Rouhe 2020
Samuele Cornell 2020
Summary
Functions:
Given a list of torch tensors it batches them together by padding to the right on each dimension in order to get same length for all. |
|
Downloads the file from the given source and saves it in the given destination path. |
|
Returns a list of files found within a folder. |
|
Gets a list from the selected field of the input csv file. |
|
Makes a tensor from list of batch values. |
|
This function takes a torch tensor of arbitrary shape and pads it to target shape by appending values on the right. |
|
Yield each (key, value) of a nested dictionary. |
|
Moves data to device, or other type, and handles containers. |
|
Similar function to dict.update, but for a nested dict. |
|
Converts a namedtuple or dictionary containing tensors to their scalar value Arguments: ---------- value: dict or namedtuple a dictionary or named tuple of tensors :returns: result -- a result dictionary :rtype: dict |
|
A very basic functional version of str.split |
|
Returns a list of splits in the sequence. |
|
Splits a path to source and filename |
|
Produces Python lists given a batch of sentences with their corresponding relative lengths. |
Reference
- speechbrain.utils.data_utils.undo_padding(batch, lengths)[source]
Produces Python lists given a batch of sentences with their corresponding relative lengths.
- Parameters
batch (tensor) – Batch of sentences gathered in a batch.
lengths (tensor) – Relative length of each sentence in the batch.
Example
>>> batch=torch.rand([4,100]) >>> lengths=torch.tensor([0.5,0.6,0.7,1.0]) >>> snt_list=undo_padding(batch, lengths) >>> len(snt_list) 4
- speechbrain.utils.data_utils.get_all_files(dirName, match_and=None, match_or=None, exclude_and=None, exclude_or=None)[source]
Returns a list of files found within a folder.
Different options can be used to restrict the search to some specific patterns.
- Parameters
dirName (str) – The directory to search.
match_and (list) – A list that contains patterns to match. The file is returned if it matches all the entries in match_and.
match_or (list) – A list that contains patterns to match. The file is returned if it matches one or more of the entries in match_or.
exclude_and (list) – A list that contains patterns to match. The file is returned if it matches none of the entries in exclude_and.
exclude_or (list) – A list that contains pattern to match. The file is returned if it fails to match one of the entries in exclude_or.
Example
>>> get_all_files('tests/samples/RIRs', match_and=['3.wav']) ['tests/samples/RIRs/rir3.wav']
- speechbrain.utils.data_utils.get_list_from_csv(csvfile, field, delimiter=',', skipinitialspace=True)[source]
Gets a list from the selected field of the input csv file.
- speechbrain.utils.data_utils.split_list(seq, num)[source]
Returns a list of splits in the sequence.
- Parameters
seq (iterable) – The input list, to be split.
num (int) – The number of chunks to produce.
Example
>>> split_list([1, 2, 3, 4, 5, 6, 7, 8, 9], 4) [[1, 2], [3, 4], [5, 6], [7, 8, 9]]
- speechbrain.utils.data_utils.recursive_items(dictionary)[source]
Yield each (key, value) of a nested dictionary.
- Parameters
dictionary (dict) – The nested dictionary to list.
- Yields
(key, value) tuples from the dictionary.
Example
>>> rec_dict={'lev1': {'lev2': {'lev3': 'current_val'}}} >>> [item for item in recursive_items(rec_dict)] [('lev3', 'current_val')]
- speechbrain.utils.data_utils.recursive_update(d, u, must_match=False)[source]
Similar function to dict.update, but for a nested dict.
From: https://stackoverflow.com/a/3233356
If you have to a nested mapping structure, for example:
{“a”: 1, “b”: {“c”: 2}}
Say you want to update the above structure with:
{“b”: {“d”: 3}}
This function will produce:
{“a”: 1, “b”: {“c”: 2, “d”: 3}}
Instead of:
{“a”: 1, “b”: {“d”: 3}}
- Parameters
Example
>>> d = {'a': 1, 'b': {'c': 2}} >>> recursive_update(d, {'b': {'d': 3}}) >>> d {'a': 1, 'b': {'c': 2, 'd': 3}}
- speechbrain.utils.data_utils.download_file(source, dest, unpack=False, dest_unpack=None, replace_existing=False)[source]
Downloads the file from the given source and saves it in the given destination path.
Arguments
the web.
- destpath
Destination path.
- unpackbool
If True, it unpacks the data in the dest folder.
- replace_existingbool
If True, replaces the existing files.
- speechbrain.utils.data_utils.pad_right_to(tensor: ~torch.Tensor, target_shape: (<class 'list'>, <class 'tuple'>), mode='constant', value=0)[source]
This function takes a torch tensor of arbitrary shape and pads it to target shape by appending values on the right.
- Parameters
tensor (input torch tensor) – Input tensor whose dimension we need to pad.
target_shape ((list, tuple)) – Target shape we want for the target tensor its len must be equal to tensor.ndim
mode (str) – Pad mode, please refer to torch.nn.functional.pad documentation.
value (float) – Pad value, please refer to torch.nn.functional.pad documentation.
- Returns
tensor (torch.Tensor) – Padded tensor.
valid_vals (list) – List containing proportion for each dimension of original, non-padded values.
- speechbrain.utils.data_utils.batch_pad_right(tensors: list, mode='constant', value=0)[source]
Given a list of torch tensors it batches them together by padding to the right on each dimension in order to get same length for all.
- Parameters
- Returns
tensor (torch.Tensor) – Padded tensor.
valid_vals (list) – List containing proportion for each dimension of original, non-padded values.
- speechbrain.utils.data_utils.split_by_whitespace(text)[source]
A very basic functional version of str.split
- speechbrain.utils.data_utils.recursive_to(data, *args, **kwargs)[source]
Moves data to device, or other type, and handles containers.
Very similar to torch.utils.data._utils.pin_memory.pin_memory, but applies .to() instead.
- speechbrain.utils.data_utils.mod_default_collate(batch)[source]
Makes a tensor from list of batch values.
Note that this doesn’t need to zip(*) values together as PaddedBatch connects them already (by key).
Here the idea is not to error out.
This is modified from: https://github.com/pytorch/pytorch/blob/c0deb231db76dbea8a9d326401417f7d1ce96ed5/torch/utils/data/_utils/collate.py#L42