speechbrain.utils.data_pipeline module¶
A pipeline for data transformations.
Example
>>> from hyperpyyaml import load_hyperpyyaml
>>> yamlstring = '''
... pipeline: !new:speechbrain.utils.data_pipeline.DataPipeline
... static_data_keys: [a, b]
... dynamic_items:
... - func: !name:operator.add
... takes: ["a", "b"]
... provides: foo
... - func: !name:operator.sub
... takes: ["foo", "b"]
... provides: bar
... output_keys: ["foo", "bar"]
... '''
>>> hparams = load_hyperpyyaml(yamlstring)
>>> hparams["pipeline"]({"a":1, "b":2})
{'foo': 3, 'bar': 1}
- Author:
Aku Rouhe
Summary¶
Classes:
Organises data transformations into a pipeline. |
|
Essentially represents a data transformation function. |
|
Essentially represents a multi-step data transformation. |
|
Data class that represents a static item. |
Functions:
Decorator which makes a DynamicItem and specifies what keys it provides. |
|
Decorator which makes a DynamicItem and specifies what keys it provides. |
|
Decorator which makes a DynamicItem and specifies its argkeys. |
|
Decorator which makes a DynamicItem and specifies its argkeys. |
Reference¶
- class speechbrain.utils.data_pipeline.StaticItem(key: str)[source]¶
Bases:
object
Data class that represents a static item.
Static items are in-memory items so they don’t need to be computed dynamically.
- class speechbrain.utils.data_pipeline.DynamicItem(takes=[], func=None, provides=[])[source]¶
Bases:
object
Essentially represents a data transformation function.
A DynamicItem takes some arguments and computes its value dynamically when called. A straight-forward use-case is to load something from disk dynamically; take the path and provide the loaded data.
Instances of this class are often created implicitly via the @takes and @provides decorators or otherwise from specifying the taken and provided arguments and the function.
A counterpart is the GeneratorDynamicItem, which should be used for generator functions.
- Parameters
- class speechbrain.utils.data_pipeline.GeneratorDynamicItem(*args, **kwargs)[source]¶
Bases:
speechbrain.utils.data_pipeline.DynamicItem
Essentially represents a multi-step data transformation.
This is the generator function counterpart for DynamicItem (which should be used for regular functions).
A GeneratorDynamicItem first takes some arguments and then uses those in multiple steps to incrementally compute some values when called.
A typical use-case is a pipeline of transformations on data: e.g. taking in text as a string, and first a tokenized version, and then on the second call providing an integer-encoded version. This can be used even though the integer-encoder needs to be trained on the first outputs.
The main benefit is to be able to define the pipeline in a clear function, even if parts of the pipeline depend on others for their initialization.
Example
>>> lab2ind = {} >>> def text_pipeline(text): ... text = text.lower().strip() ... text = "".join(c for c in text if c.isalpha() or c == " ") ... words = text.split() ... yield words ... encoded = [lab2ind[word] for word in words] ... yield encoded >>> item = GeneratorDynamicItem( ... func=text_pipeline, ... takes=["text"], ... provides=["words", "words_encoded"]) >>> # First create the integer-encoding: >>> ind = 1 >>> for token in item("Is this it? - This is it."): ... if token not in lab2ind: ... lab2ind[token] = ind ... ind += 1 >>> # Now the integers can be encoded! >>> item() [1, 2, 3, 2, 1, 3]
- speechbrain.utils.data_pipeline.takes(*argkeys)[source]¶
Decorator which makes a DynamicItem and specifies its argkeys.
If the wrapped object is a generator function (has a yield statement), Creates a GeneratorDynamicItem. If the object is already a DynamicItem, just specifies the argkeys for that. Otherwise creates a new regular DynamicItem, with argkeys specified.
The args are always passed to the function at the start. Generators could support sending new arguments, but for such use cases, simply create a new dynamic item. The GeneratorDynamicItem class is meant for pipelines which take in an input and transform it in multiple ways, where the intermediate representations may be needed for e.g. fitting a BPE segmenter.
Example
>>> @takes("text") ... def tokenize(text): ... return text.strip().lower().split() >>> tokenize.provides = ["tokenized"] >>> tokenize(' This Example gets tokenized') ['this', 'example', 'gets', 'tokenized']
- speechbrain.utils.data_pipeline.takes_decorator(*argkeys)¶
Decorator which makes a DynamicItem and specifies its argkeys.
If the wrapped object is a generator function (has a yield statement), Creates a GeneratorDynamicItem. If the object is already a DynamicItem, just specifies the argkeys for that. Otherwise creates a new regular DynamicItem, with argkeys specified.
The args are always passed to the function at the start. Generators could support sending new arguments, but for such use cases, simply create a new dynamic item. The GeneratorDynamicItem class is meant for pipelines which take in an input and transform it in multiple ways, where the intermediate representations may be needed for e.g. fitting a BPE segmenter.
Example
>>> @takes("text") ... def tokenize(text): ... return text.strip().lower().split() >>> tokenize.provides = ["tokenized"] >>> tokenize(' This Example gets tokenized') ['this', 'example', 'gets', 'tokenized']
- speechbrain.utils.data_pipeline.provides(*output_keys)[source]¶
Decorator which makes a DynamicItem and specifies what keys it provides.
If the wrapped object is a generator function (has a yield statement), Creates a GeneratorDynamicItem. If the object is already a DynamicItem, just specifies the provided keys for that. Otherwise creates a new regular DynamicItem, with provided keys specified.
Note
The behavior is slightly different for generators and regular functions, if many output keys are specified, e.g. @provides(“signal”, “mfcc”). Regular functions should return a tuple with len equal to len(output_keys), while generators should yield the items one by one.
>>> @provides("signal", "feat") ... def read_feat(): ... wav = [.1,.2,-.1] ... feat = [s**2 for s in wav] ... return wav, feat >>> @provides("signal", "feat") ... def read_feat(): ... wav = [.1,.2,-.1] ... yield wav ... feat = [s**2 for s in wav] ... yield feat
If multiple keys are yielded at once, write e.g.,
>>> @provides("wav_read", ["left_channel", "right_channel"]) ... def read_multi_channel(): ... wav = [[.1,.2,-.1],[.2,.1,-.1]] ... yield wav ... yield wav[0], wav[1]
- speechbrain.utils.data_pipeline.provides_decorator(*output_keys)¶
Decorator which makes a DynamicItem and specifies what keys it provides.
If the wrapped object is a generator function (has a yield statement), Creates a GeneratorDynamicItem. If the object is already a DynamicItem, just specifies the provided keys for that. Otherwise creates a new regular DynamicItem, with provided keys specified.
Note
The behavior is slightly different for generators and regular functions, if many output keys are specified, e.g. @provides(“signal”, “mfcc”). Regular functions should return a tuple with len equal to len(output_keys), while generators should yield the items one by one.
>>> @provides("signal", "feat") ... def read_feat(): ... wav = [.1,.2,-.1] ... feat = [s**2 for s in wav] ... return wav, feat >>> @provides("signal", "feat") ... def read_feat(): ... wav = [.1,.2,-.1] ... yield wav ... feat = [s**2 for s in wav] ... yield feat
If multiple keys are yielded at once, write e.g.,
>>> @provides("wav_read", ["left_channel", "right_channel"]) ... def read_multi_channel(): ... wav = [[.1,.2,-.1],[.2,.1,-.1]] ... yield wav ... yield wav[0], wav[1]
- class speechbrain.utils.data_pipeline.DataPipeline(static_data_keys, dynamic_items=[], output_keys=[])[source]¶
Bases:
object
Organises data transformations into a pipeline.
Example
>>> pipeline = DataPipeline( ... static_data_keys=["text"], ... dynamic_items=[ ... {"func": lambda x: x.lower(), "takes": "text", "provides": "foo"}, ... {"func": lambda x: x[::-1], "takes": "foo", "provides": "bar"}, ... ], ... output_keys=["bar"], ... ) >>> pipeline({"text": "Test"}) {'bar': 'tset'}
- add_static_keys(static_keys)[source]¶
Informs the pipeline about static items.
Static items are the ones provided to __call__ as data.
- add_dynamic_item(func, takes=None, provides=None)[source]¶
Adds a dynamic item to the Pipeline.
Two calling conventions. For DynamicItem objects, just use: add_dynamic_item(dynamic_item) But otherwise, should use: add_dynamic_item(func, takes, provides)
- Parameters
func (callable, DynamicItem) – If a DynamicItem is given, adds that directly. Otherwise a DynamicItem is created, and this specifies the callable to use. If a generator function is given, then create a GeneratorDynamicItem. Otherwise creates a normal DynamicItem.
takes (list, str) – List of keys. When func is called, each key is resolved to either an entry in the data or the output of another dynamic_item. The func is then called with these as positional arguments, in the same order as specified here. A single key can be given as a bare string.
provides (str, list) – For regular functions, the key or list of keys that it provides. If you give a generator function, key or list of keys that it yields, in order. Also see the provides decorator. A single key can be given as a bare string.
- set_output_keys(keys)[source]¶
Use this to change the output keys.
Also re-evaluates execution order. So if you request different outputs, some parts of the data pipeline may be skipped.