speechbrain.dataio.dataio module
Data reading and writing.
- Authors
Mirco Ravanelli 2020
Aku Rouhe 2020
Ju-Chieh Chou 2020
Samuele Cornell 2020
Abdel HEBA 2020
Gaelle Laperriere 2021
Sahar Ghannay 2021
Sylvain de Langen 2022
Summary
Classes:
Write CSV files a line at a time. |
Functions:
Create labels with <eos> token appended. |
|
Sets the value of any padding on the specified tensor to mask_value. |
|
Sets the value of any padding on the specified tensor to mask_value. |
|
Convert a batch of integer IDs to string labels. |
|
keep the semantic concepts and values for evaluation. |
|
Get the md5 checksum of an input file. |
|
Creates a binary mask for each sequence. |
|
Loads CSV and formats string values. |
|
Loads JSON and recursively formats string values. |
|
Utility function for loading .pkl pickle files. |
|
Loads a pkl file. |
|
Merge characters sequences into word sequences. |
|
Merging several csv files into one file. |
|
Create labels with <bos> token at the beginning. |
|
General audio loading, based on a custom notation. |
|
Retrieves audio metadata from a file path. |
|
General audio loading, based on a custom notation. |
|
Read labels in kaldi format. |
|
Converts SpeechBrain style relative length to the absolute duration. |
|
Saves the md5 of a list of input files as a pickled dict into a file. |
|
Save an object in pkl format. |
|
Split word sequences into character sequences. |
|
|
|
|
|
|
|
Write audio on disk. |
|
Write data to standard output. |
|
Write data in text format. |
Reference
- speechbrain.dataio.dataio.load_data_json(json_path, replacements={})[source]
Loads JSON and recursively formats string values.
- Parameters:
- Returns:
JSON data with replacements applied.
- Return type:
Example
>>> json_spec = '''{ ... "ex1": {"files": ["{ROOT}/mic1/ex1.wav", "{ROOT}/mic2/ex1.wav"], "id": 1}, ... "ex2": {"files": [{"spk1": "{ROOT}/ex2.wav"}, {"spk2": "{ROOT}/ex2.wav"}], "id": 2} ... } ... ''' >>> tmpfile = getfixture('tmpdir') / "test.json" >>> with open(tmpfile, "w") as fo: ... _ = fo.write(json_spec) >>> data = load_data_json(tmpfile, {"ROOT": "/home"}) >>> data["ex1"]["files"][0] '/home/mic1/ex1.wav' >>> data["ex2"]["files"][1]["spk2"] '/home/ex2.wav'
- speechbrain.dataio.dataio.load_data_csv(csv_path, replacements={})[source]
Loads CSV and formats string values.
Uses the SpeechBrain legacy CSV data format, where the CSV must have an ‘ID’ field. If there is a field called duration, it is interpreted as a float. The rest of the fields are left as they are (legacy _format and _opts fields are not used to load the data in any special way).
Bash-like string replacements with $to_replace are supported.
- Parameters:
- Returns:
CSV data with replacements applied.
- Return type:
Example
>>> csv_spec = '''ID,duration,wav_path ... utt1,1.45,$data_folder/utt1.wav ... utt2,2.0,$data_folder/utt2.wav ... ''' >>> tmpfile = getfixture("tmpdir") / "test.csv" >>> with open(tmpfile, "w") as fo: ... _ = fo.write(csv_spec) >>> data = load_data_csv(tmpfile, {"data_folder": "/home"}) >>> data["utt1"]["wav_path"] '/home/utt1.wav'
- speechbrain.dataio.dataio.read_audio_info(path) AudioMetaData [source]
Retrieves audio metadata from a file path. Behaves identically to torchaudio.info, but attempts to fix metadata (such as frame count) that is otherwise broken with certain torchaudio version and codec combinations.
Note that this may cause full file traversal in certain cases!
- Parameters:
path (str) – Path to the audio file to examine.
- Returns:
Same value as returned by torchaudio.info, but may eventually have num_frames corrected if it otherwise would have been == 0.
- Return type:
torchaudio.backend.common.AudioMetaData
Note
Some codecs, such as MP3, require full file traversal for accurate length information to be retrieved. In these cases, you may as well read the entire audio file to avoid doubling the processing time.
- speechbrain.dataio.dataio.read_audio(waveforms_obj)[source]
General audio loading, based on a custom notation.
Expected use case is in conjunction with Datasets specified by JSON.
The parameter may just be a path to a file: read_audio(“/path/to/wav1.wav”)
Alternatively, you can specify more options in a dict, e.g.: ``` # load a file from sample 8000 through 15999 read_audio({
“file”: “/path/to/wav2.wav”, “start”: 8000, “stop”: 16000
})
Which codecs are supported depends on your torchaudio backend. Refer to torchaudio.load documentation for further details.
- param waveforms_obj:
Path to audio or dict with the desired configuration.
Keys for the dict variant: - “file” (str): Path to the audio file. - “start” (int, optional): The first sample to load. If unspecified, load from the very first frame. - “stop” (int, optional): The last sample to load (exclusive). If unspecified or equal to start, load from start to the end. Will not fail if stop is past the sample count of the file and will return less frames.
- type waveforms_obj:
str, dict
- returns:
1-channel: audio tensor with shape: (samples, ). >=2-channels: audio tensor with shape: (samples, channels).
- rtype:
torch.Tensor
Example
>>> dummywav = torch.rand(16000) >>> import os >>> tmpfile = str(getfixture('tmpdir') / "wave.wav") >>> write_audio(tmpfile, dummywav, 16000) >>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"} >>> loaded = read_audio(asr_example["wav"]) >>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend True
- speechbrain.dataio.dataio.read_audio_multichannel(waveforms_obj)[source]
General audio loading, based on a custom notation.
Expected use case is in conjunction with Datasets specified by JSON.
The custom notation:
The annotation can be just a path to a file: “/path/to/wav1.wav”
Multiple (possibly multi-channel) files can be specified, as long as they have the same length: {“files”: [
“/path/to/wav1.wav”, “/path/to/wav2.wav” ]
}
Or you can specify a single file more succinctly: {“files”: “/path/to/wav2.wav”}
Offset number samples and stop number samples also can be specified to read only a segment within the files. {“files”: [
“/path/to/wav1.wav”, “/path/to/wav2.wav” ]
“start”: 8000 “stop”: 16000 }
- Parameters:
waveforms_obj (str, dict) – Audio reading annotation, see above for format.
- Returns:
Audio tensor with shape: (samples, ).
- Return type:
Example
>>> dummywav = torch.rand(16000, 2) >>> import os >>> tmpfile = str(getfixture('tmpdir') / "wave.wav") >>> write_audio(tmpfile, dummywav, 16000) >>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"} >>> loaded = read_audio(asr_example["wav"]) >>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend True
- speechbrain.dataio.dataio.write_audio(filepath, audio, samplerate)[source]
Write audio on disk. It is basically a wrapper to support saving audio signals in the speechbrain format (audio, channels).
- Parameters:
filepath (path) – Path where to save the audio file.
audio (torch.Tensor) – Audio file in the expected speechbrain format (signal, channels).
samplerate (int) – Sample rate (e.g., 16000).
Example
>>> import os >>> tmpfile = str(getfixture('tmpdir') / "wave.wav") >>> dummywav = torch.rand(16000, 2) >>> write_audio(tmpfile, dummywav, 16000) >>> loaded = read_audio(tmpfile) >>> loaded.allclose(dummywav,atol=1e-4) # replace with eq with sox_io backend True
- speechbrain.dataio.dataio.load_pickle(pickle_path)[source]
Utility function for loading .pkl pickle files.
- speechbrain.dataio.dataio.to_floatTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
- speechbrain.dataio.dataio.to_doubleTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
- speechbrain.dataio.dataio.to_longTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
- speechbrain.dataio.dataio.convert_index_to_lab(batch, ind2lab)[source]
Convert a batch of integer IDs to string labels.
- Parameters:
- Returns:
List of lists, same size as batch, with labels from ind2lab.
- Return type:
Example
>>> ind2lab = {1: "h", 2: "e", 3: "l", 4: "o"} >>> out = convert_index_to_lab([[4,1], [1,2,3,3,4]], ind2lab) >>> for seq in out: ... print("".join(seq)) oh hello
- speechbrain.dataio.dataio.relative_time_to_absolute(batch, relative_lens, rate)[source]
Converts SpeechBrain style relative length to the absolute duration.
Operates on batch level.
- Parameters:
batch (torch.tensor) – Sequences to determine the duration for.
relative_lens (torch.tensor) – The relative length of each sequence in batch. The longest sequence in the batch needs to have relative length 1.0.
rate (float) – The rate at which sequence elements occur in real-world time. Sample rate, if batch is raw wavs (recommended) or 1/frame_shift if batch is features. This has to have 1/s as the unit.
- Returns:
Duration of each sequence in seconds.
- Return type:
torch.tensor
Example
>>> batch = torch.ones(2, 16000) >>> relative_lens = torch.tensor([3./4., 1.0]) >>> rate = 16000 >>> print(relative_time_to_absolute(batch, relative_lens, rate)) tensor([0.7500, 1.0000])
- class speechbrain.dataio.dataio.IterativeCSVWriter(outstream, data_fields, defaults={})[source]
Bases:
object
Write CSV files a line at a time.
- Parameters:
outstream (file-object) – A writeable stream
data_fields (list) – List of the optional keys to write. Each key will be expanded to the SpeechBrain format, producing three fields: key, key_format, key_opts.
Example
>>> import io >>> f = io.StringIO() >>> writer = IterativeCSVWriter(f, ["phn"]) >>> print(f.getvalue()) ID,duration,phn,phn_format,phn_opts >>> writer.write("UTT1",2.5,"sil hh ee ll ll oo sil","string","") >>> print(f.getvalue()) ID,duration,phn,phn_format,phn_opts UTT1,2.5,sil hh ee ll ll oo sil,string, >>> writer.write(ID="UTT2",phn="sil ww oo rr ll dd sil",phn_format="string") >>> print(f.getvalue()) ID,duration,phn,phn_format,phn_opts UTT1,2.5,sil hh ee ll ll oo sil,string, UTT2,,sil ww oo rr ll dd sil,string, >>> writer.set_default('phn_format', 'string') >>> writer.write_batch(ID=["UTT3","UTT4"],phn=["ff oo oo", "bb aa rr"]) >>> print(f.getvalue()) ID,duration,phn,phn_format,phn_opts UTT1,2.5,sil hh ee ll ll oo sil,string, UTT2,,sil ww oo rr ll dd sil,string, UTT3,,ff oo oo,string, UTT4,,bb aa rr,string,
- set_default(field, value)[source]
Sets a default value for the given CSV field.
- Parameters:
field (str) – A field in the CSV.
value – The default value.
- write(*args, **kwargs)[source]
Writes one data line into the CSV.
- Parameters:
*args – Supply every field with a value in positional form OR.
**kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.
- write_batch(*args, **kwargs)[source]
Writes a batch of lines into the CSV.
Here each argument should be a list with the same length.
- Parameters:
*args – Supply every field with a value in positional form OR.
**kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.
- speechbrain.dataio.dataio.write_txt_file(data, filename, sampling_rate=None)[source]
Write data in text format.
- Parameters:
data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.
filename (str) – Path to file where to write the data.
sampling_rate (None) – Not used, just here for interface compatibility.
- Return type:
None
Example
>>> tmpdir = getfixture('tmpdir') >>> signal=torch.tensor([1,2,3,4]) >>> write_txt_file(signal, tmpdir / 'example.txt')
- speechbrain.dataio.dataio.write_stdout(data, filename=None, sampling_rate=None)[source]
Write data to standard output.
- Parameters:
data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.
filename (None) – Not used, just here for compatibility.
sampling_rate (None) – Not used, just here for compatibility.
- Return type:
None
Example
>>> tmpdir = getfixture('tmpdir') >>> signal = torch.tensor([[1,2,3,4]]) >>> write_stdout(signal, tmpdir / 'example.txt') [1, 2, 3, 4]
- speechbrain.dataio.dataio.length_to_mask(length, max_len=None, dtype=None, device=None)[source]
Creates a binary mask for each sequence.
Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3
- Parameters:
length (torch.LongTensor) – Containing the length of each sequence in the batch. Must be 1D.
max_len (int) – Max length for the mask, also the size of the second dimension.
dtype (torch.dtype, default: None) – The dtype of the generated mask.
device (torch.device, default: None) – The device to put the mask variable.
- Returns:
mask – The binary mask.
- Return type:
tensor
Example
>>> length=torch.Tensor([1,2,3]) >>> mask=length_to_mask(length) >>> mask tensor([[1., 0., 0.], [1., 1., 0.], [1., 1., 1.]])
- speechbrain.dataio.dataio.read_kaldi_lab(kaldi_ali, kaldi_lab_opts)[source]
Read labels in kaldi format.
Uses kaldi IO.
- Parameters:
- Returns:
lab – A dictionary containing the labels.
- Return type:
Note
This depends on kaldi-io-for-python. Install it separately. See: https://github.com/vesis84/kaldi-io-for-python
Example
This example requires kaldi files.
` lab_folder = '/home/kaldi/egs/TIMIT/s5/exp/dnn4_pretrain-dbn_dnn_ali' read_kaldi_lab(lab_folder, 'ali-to-pdf') `
- speechbrain.dataio.dataio.get_md5(file)[source]
Get the md5 checksum of an input file.
- Parameters:
file (str) – Path to file for which compute the checksum.
- Returns:
Checksum for the given filepath.
- Return type:
md5
Example
>>> get_md5('tests/samples/single-mic/example1.wav') 'c482d0081ca35302d30d12f1136c34e5'
- speechbrain.dataio.dataio.save_md5(files, out_file)[source]
Saves the md5 of a list of input files as a pickled dict into a file.
- speechbrain.dataio.dataio.save_pkl(obj, file)[source]
Save an object in pkl format.
- Parameters:
Example
>>> tmpfile = getfixture('tmpdir') / "example.pkl" >>> save_pkl([1, 2, 3, 4, 5], tmpfile) >>> load_pkl(tmpfile) [1, 2, 3, 4, 5]
- speechbrain.dataio.dataio.load_pkl(file)[source]
Loads a pkl file.
For an example, see save_pkl.
- Parameters:
file (str) – Path to the input pkl file.
- Return type:
The loaded object.
- speechbrain.dataio.dataio.prepend_bos_token(label, bos_index)[source]
Create labels with <bos> token at the beginning.
- Parameters:
label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length].
bos_index (int) – The index for <bos> token.
- Returns:
new_label – The new label with <bos> at the beginning.
- Return type:
tensor
Example
>>> label=torch.LongTensor([[1,0,0], [2,3,0], [4,5,6]]) >>> new_label=prepend_bos_token(label, bos_index=7) >>> new_label tensor([[7, 1, 0, 0], [7, 2, 3, 0], [7, 4, 5, 6]])
- speechbrain.dataio.dataio.append_eos_token(label, length, eos_index)[source]
Create labels with <eos> token appended.
- Parameters:
label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length]
length (torch.LongTensor) – Containing the original length of each label sequences. Must be 1D.
eos_index (int) – The index for <eos> token.
- Returns:
new_label – The new label with <eos> appended.
- Return type:
tensor
Example
>>> label=torch.IntTensor([[1,0,0], [2,3,0], [4,5,6]]) >>> length=torch.LongTensor([1,2,3]) >>> new_label=append_eos_token(label, length, eos_index=7) >>> new_label tensor([[1, 7, 0, 0], [2, 3, 7, 0], [4, 5, 6, 7]], dtype=torch.int32)
- speechbrain.dataio.dataio.merge_char(sequences, space='_')[source]
Merge characters sequences into word sequences.
- Parameters:
sequences (list) – Each item contains a list, and this list contains a character sequence.
space (string) – The token represents space. Default: _
- Return type:
The list contains word sequences for each sentence.
Example
>>> sequences = [["a", "b", "_", "c", "_", "d", "e"], ["e", "f", "g", "_", "h", "i"]] >>> results = merge_char(sequences) >>> results [['ab', 'c', 'de'], ['efg', 'hi']]
- speechbrain.dataio.dataio.merge_csvs(data_folder, csv_lst, merged_csv)[source]
Merging several csv files into one file.
- Parameters:
data_folder (string) – The folder to store csv files to be merged and after merging.
csv_lst (list) – Filenames of csv file to be merged.
merged_csv (string) – The filename to write the merged csv file.
Example
>>> tmpdir = getfixture('tmpdir') >>> os.symlink(os.path.realpath("tests/samples/annotation/speech.csv"), tmpdir / "speech.csv") >>> merge_csvs(tmpdir, ... ["speech.csv", "speech.csv"], ... "test_csv_merge.csv")
- speechbrain.dataio.dataio.split_word(sequences, space='_')[source]
Split word sequences into character sequences.
- Parameters:
sequences (list) – Each item contains a list, and this list contains a words sequence.
space (string) – The token represents space. Default: _
- Return type:
The list contains word sequences for each sentence.
Example
>>> sequences = [['ab', 'c', 'de'], ['efg', 'hi']] >>> results = split_word(sequences) >>> results [['a', 'b', '_', 'c', '_', 'd', 'e'], ['e', 'f', 'g', '_', 'h', 'i']]
- speechbrain.dataio.dataio.clean_padding_(tensor, length, len_dim=1, mask_value=0.0)[source]
Sets the value of any padding on the specified tensor to mask_value.
For instance, this can be used to zero out the outputs of an autoencoder during training past the specified length.
This is an in-place operation
- Parameters:
tensor (torch.Tensor) – a tensor of arbitrary dimension
length (torch.Tensor) – a 1-D tensor of lengths
len_dim (int) – the dimension representing the length
mask_value (mixed) – the value to be assigned to padding positions
Example
>>> import torch >>> x = torch.arange(5).unsqueeze(0).repeat(3, 1) >>> x = x + torch.arange(3).unsqueeze(-1) >>> x tensor([[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6]]) >>> length = torch.tensor([0.4, 1.0, 0.6]) >>> clean_padding_(x, length=length, mask_value=10.) >>> x tensor([[ 0, 1, 10, 10, 10], [ 1, 2, 3, 4, 5], [ 2, 3, 4, 10, 10]]) >>> x = torch.arange(5)[None, :, None].repeat(3, 1, 2) >>> x = x + torch.arange(3)[:, None, None] >>> x = x * torch.arange(1, 3)[None, None, :] >>> x = x.transpose(1, 2) >>> x tensor([[[ 0, 1, 2, 3, 4], [ 0, 2, 4, 6, 8]], [[ 1, 2, 3, 4, 5], [ 2, 4, 6, 8, 10]], [[ 2, 3, 4, 5, 6], [ 4, 6, 8, 10, 12]]]) >>> clean_padding_(x, length=length, mask_value=10., len_dim=2) >>> x tensor([[[ 0, 1, 10, 10, 10], [ 0, 2, 10, 10, 10]], [[ 1, 2, 3, 4, 5], [ 2, 4, 6, 8, 10]], [[ 2, 3, 4, 10, 10], [ 4, 6, 8, 10, 10]]])
- speechbrain.dataio.dataio.clean_padding(tensor, length, len_dim=1, mask_value=0.0)[source]
Sets the value of any padding on the specified tensor to mask_value.
For instance, this can be used to zero out the outputs of an autoencoder during training past the specified length.
This version of the operation does not modify the original tensor
- Parameters:
tensor (torch.Tensor) – a tensor of arbitrary dimension
length (torch.Tensor) – a 1-D tensor of lengths
len_dim (int) – the dimension representing the length
mask_value (mixed) – the value to be assigned to padding positions
Example
>>> import torch >>> x = torch.arange(5).unsqueeze(0).repeat(3, 1) >>> x = x + torch.arange(3).unsqueeze(-1) >>> x tensor([[0, 1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 3, 4, 5, 6]]) >>> length = torch.tensor([0.4, 1.0, 0.6]) >>> x_p = clean_padding(x, length=length, mask_value=10.) >>> x_p tensor([[ 0, 1, 10, 10, 10], [ 1, 2, 3, 4, 5], [ 2, 3, 4, 10, 10]]) >>> x = torch.arange(5)[None, :, None].repeat(3, 1, 2) >>> x = x + torch.arange(3)[:, None, None] >>> x = x * torch.arange(1, 3)[None, None, :] >>> x = x.transpose(1, 2) >>> x tensor([[[ 0, 1, 2, 3, 4], [ 0, 2, 4, 6, 8]], [[ 1, 2, 3, 4, 5], [ 2, 4, 6, 8, 10]], [[ 2, 3, 4, 5, 6], [ 4, 6, 8, 10, 12]]]) >>> x_p = clean_padding(x, length=length, mask_value=10., len_dim=2) >>> x_p tensor([[[ 0, 1, 10, 10, 10], [ 0, 2, 10, 10, 10]], [[ 1, 2, 3, 4, 5], [ 2, 4, 6, 8, 10]], [[ 2, 3, 4, 10, 10], [ 4, 6, 8, 10, 10]]])
- speechbrain.dataio.dataio.extract_concepts_values(sequences, keep_values, tag_in, tag_out, space)[source]
keep the semantic concepts and values for evaluation.
- Parameters:
sequences (list) – Each item contains a list, and this list contains a character sequence.
keep_values (bool) – If True, keep the values. If not don’t.
tag_in (char) – Indicates the start of the concept.
tag_out (char) – Indicates the end of the concept.
space (string) – The token represents space. Default: _
- Return type:
The list contains concept and value sequences for each sentence.
Example
>>> sequences = [['<reponse>','_','n','o','_','>','_','<localisation-ville>','_','L','e','_','M','a','n','s','_','>'], ['<reponse>','_','s','i','_','>'],['v','a','_','b','e','n','e']] >>> results = extract_concepts_values(sequences, True, '<', '>', '_') >>> results [['<reponse> no', '<localisation-ville> Le Mans'], ['<reponse> si'], ['']]