speechbrain.dataio.dataio module

Data reading and writing.

Authors
  • Mirco Ravanelli 2020

  • Aku Rouhe 2020

  • Ju-Chieh Chou 2020

  • Samuele Cornell 2020

  • Abdel HEBA 2020

Summary

Classes:

IterativeCSVWriter

Write CSV files a line at a time.

Functions:

append_eos_token

Create labels with <eos> token appended.

convert_index_to_lab

Convert a batch of integer IDs to string labels.

get_md5

Get the md5 checksum of an input file.

length_to_mask

Creates a binary mask for each sequence.

load_data_csv

Loads CSV and formats string values.

load_data_json

Loads JSON and recursively formats string values.

load_pickle

Utility function for loading .pkl pickle files.

load_pkl

Loads a pkl file.

merge_char

Merge characters sequences into word sequences.

merge_csvs

Merging several csv files into one file.

prepend_bos_token

Create labels with <bos> token at the beginning.

read_audio

General audio loading, based on a custom notation.

read_audio_multichannel

General audio loading, based on a custom notation.

read_kaldi_lab

Read labels in kaldi format.

relative_time_to_absolute

Converts SpeechBrain style relative length to the absolute duration.

save_md5

Saves the md5 of a list of input files as a pickled dict into a file.

save_pkl

Save an object in pkl format.

split_word

Split word sequences into character sequences.

to_doubleTensor

param x

Input data to be converted to torch double.

to_floatTensor

param x

Input data to be converted to torch float.

to_longTensor

param x

Input data to be converted to torch long.

write_audio

Write audio on disk.

write_stdout

Write data to standard output.

write_txt_file

Write data in text format.

Reference

speechbrain.dataio.dataio.load_data_json(json_path, replacements={})[source]

Loads JSON and recursively formats string values.

Parameters
  • json_path (str) – Path to CSV file.

  • replacements (dict) – (Optional dict), e.g., {“data_folder”: “/home/speechbrain/data”}. This is used to recursively format all string values in the data.

Returns

JSON data with replacements applied.

Return type

dict

Example

>>> json_spec = '''{
...   "ex1": {"files": ["{ROOT}/mic1/ex1.wav", "{ROOT}/mic2/ex1.wav"], "id": 1},
...   "ex2": {"files": [{"spk1": "{ROOT}/ex2.wav"}, {"spk2": "{ROOT}/ex2.wav"}], "id": 2}
... }
... '''
>>> tmpfile = getfixture('tmpdir') / "test.json"
>>> with open(tmpfile, "w") as fo:
...     _ = fo.write(json_spec)
>>> data = load_data_json(tmpfile, {"ROOT": "/home"})
>>> data["ex1"]["files"][0]
'/home/mic1/ex1.wav'
>>> data["ex2"]["files"][1]["spk2"]
'/home/ex2.wav'
speechbrain.dataio.dataio.load_data_csv(csv_path, replacements={})[source]

Loads CSV and formats string values.

Uses the SpeechBrain legacy CSV data format, where the CSV must have an ‘ID’ field. If there is a field called duration, it is interpreted as a float. The rest of the fields are left as they are (legacy _format and _opts fields are not used to load the data in any special way).

Bash-like string replacements with $to_replace are supported.

Parameters
  • csv_path (str) – Path to CSV file.

  • replacements (dict) – (Optional dict), e.g., {“data_folder”: “/home/speechbrain/data”} This is used to recursively format all string values in the data.

Returns

CSV data with replacements applied.

Return type

dict

Example

>>> csv_spec = '''ID,duration,wav_path
... utt1,1.45,$data_folder/utt1.wav
... utt2,2.0,$data_folder/utt2.wav
... '''
>>> tmpfile = getfixture("tmpdir") / "test.csv"
>>> with open(tmpfile, "w") as fo:
...     _ = fo.write(csv_spec)
>>> data = load_data_csv(tmpfile, {"data_folder": "/home"})
>>> data["utt1"]["wav_path"]
'/home/utt1.wav'
speechbrain.dataio.dataio.read_audio(waveforms_obj)[source]

General audio loading, based on a custom notation.

Expected use case is in conjunction with Datasets specified by JSON.

The custom notation:

The annotation can be just a path to a file: “/path/to/wav1.wav”

Or can specify more options in a dict: {“file”: “/path/to/wav2.wav”, “start”: 8000, “stop”: 16000 }

Parameters

waveforms_obj (str, dict) – Audio reading annotation, see above for format.

Returns

Audio tensor with shape: (samples, ).

Return type

torch.Tensor

Example

>>> dummywav = torch.rand(16000)
>>> import os
>>> tmpfile = os.path.join(str(getfixture('tmpdir')),  "wave.wav")
>>> write_audio(tmpfile, dummywav, 16000)
>>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"}
>>> loaded = read_audio(asr_example["wav"])
>>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend
True
speechbrain.dataio.dataio.read_audio_multichannel(waveforms_obj)[source]

General audio loading, based on a custom notation.

Expected use case is in conjunction with Datasets specified by JSON.

The custom notation:

The annotation can be just a path to a file: “/path/to/wav1.wav”

Multiple (possibly multi-channel) files can be specified, as long as they have the same length: {“files”: [

“/path/to/wav1.wav”, “/path/to/wav2.wav” ]

}

Or you can specify a single file more succinctly: {“files”: “/path/to/wav2.wav”}

Offset number samples and stop number samples also can be specified to read only a segment within the files. {“files”: [

“/path/to/wav1.wav”, “/path/to/wav2.wav” ]

“start”: 8000 “stop”: 16000 }

Parameters

waveforms_obj (str, dict) – Audio reading annotation, see above for format.

Returns

Audio tensor with shape: (samples, ).

Return type

torch.Tensor

Example

>>> dummywav = torch.rand(16000, 2)
>>> import os
>>> tmpfile = os.path.join(str(getfixture('tmpdir')),  "wave.wav")
>>> write_audio(tmpfile, dummywav, 16000)
>>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"}
>>> loaded = read_audio(asr_example["wav"])
>>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend
True
speechbrain.dataio.dataio.write_audio(filepath, audio, samplerate)[source]

Write audio on disk. It is basically a wrapper to support saving audio signals in the speechbrain format (audio, channels).

Parameters
  • filepath (path) – Path where to save the audio file.

  • audio (torch.Tensor) – Audio file in the expected speechbrain format (signal, channels).

  • samplerate (int) – Sample rate (e.g., 16000).

Example

>>> import os
>>> tmpfile = os.path.join(str(getfixture('tmpdir')),  "wave.wav")
>>> dummywav = torch.rand(16000, 2)
>>> write_audio(tmpfile, dummywav, 16000)
>>> loaded = read_audio(tmpfile)
>>> loaded.allclose(dummywav,atol=1e-4) # replace with eq with sox_io backend
True
speechbrain.dataio.dataio.load_pickle(pickle_path)[source]

Utility function for loading .pkl pickle files.

Parameters

pickle_path (str) – Path to pickle file.

Returns

out – Python object loaded from pickle.

Return type

object

speechbrain.dataio.dataio.to_floatTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
Parameters

x ((list, tuple, np.ndarray)) – Input data to be converted to torch float.

Returns

tensor – Data now in torch.tensor float datatype.

Return type

torch.tensor

speechbrain.dataio.dataio.to_doubleTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
Parameters

x ((list, tuple, np.ndarray)) – Input data to be converted to torch double.

Returns

tensor – Data now in torch.tensor double datatype.

Return type

torch.tensor

speechbrain.dataio.dataio.to_longTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
Parameters

x ((list, tuple, np.ndarray)) – Input data to be converted to torch long.

Returns

tensor – Data now in torch.tensor long datatype.

Return type

torch.tensor

speechbrain.dataio.dataio.convert_index_to_lab(batch, ind2lab)[source]

Convert a batch of integer IDs to string labels.

Parameters
  • batch (list) – List of lists, a batch of sequences.

  • ind2lab (dict) – Mapping from integer IDs to labels.

Returns

List of lists, same size as batch, with labels from ind2lab.

Return type

list

Example

>>> ind2lab = {1: "h", 2: "e", 3: "l", 4: "o"}
>>> out = convert_index_to_lab([[4,1], [1,2,3,3,4]], ind2lab)
>>> for seq in out:
...     print("".join(seq))
oh
hello
speechbrain.dataio.dataio.relative_time_to_absolute(batch, relative_lens, rate)[source]

Converts SpeechBrain style relative length to the absolute duration.

Operates on batch level.

Parameters
  • batch (torch.tensor) – Sequences to determine the duration for.

  • relative_lens (torch.tensor) – The relative length of each sequence in batch. The longest sequence in the batch needs to have relative length 1.0.

  • rate (float) – The rate at which sequence elements occur in real-world time. Sample rate, if batch is raw wavs (recommended) or 1/frame_shift if batch is features. This has to have 1/s as the unit.

Returns

Duration of each sequence in seconds.

Return type

torch.tensor

Example

>>> batch = torch.ones(2, 16000)
>>> relative_lens = torch.tensor([3./4., 1.0])
>>> rate = 16000
>>> print(relative_time_to_absolute(batch, relative_lens, rate))
tensor([0.7500, 1.0000])
class speechbrain.dataio.dataio.IterativeCSVWriter(outstream, data_fields, defaults={})[source]

Bases: object

Write CSV files a line at a time.

Parameters
  • outstream (file-object) – A writeable stream

  • data_fields (list) – List of the optional keys to write. Each key will be expanded to the SpeechBrain format, producing three fields: key, key_format, key_opts.

Example

>>> import io
>>> f = io.StringIO()
>>> writer = IterativeCSVWriter(f, ["phn"])
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
>>> writer.write("UTT1",2.5,"sil hh ee ll ll oo sil","string","")
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
>>> writer.write(ID="UTT2",phn="sil ww oo rr ll dd sil",phn_format="string")
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
UTT2,,sil ww oo rr ll dd sil,string,
>>> writer.set_default('phn_format', 'string')
>>> writer.write_batch(ID=["UTT3","UTT4"],phn=["ff oo oo", "bb aa rr"])
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
UTT2,,sil ww oo rr ll dd sil,string,
UTT3,,ff oo oo,string,
UTT4,,bb aa rr,string,
set_default(field, value)[source]

Sets a default value for the given CSV field.

Parameters
  • field (str) – A field in the CSV.

  • value – The default value.

write(*args, **kwargs)[source]

Writes one data line into the CSV.

Parameters
  • *args – Supply every field with a value in positional form OR.

  • **kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.

write_batch(*args, **kwargs)[source]

Writes a batch of lines into the CSV.

Here each argument should be a list with the same length.

Parameters
  • *args – Supply every field with a value in positional form OR.

  • **kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.

speechbrain.dataio.dataio.write_txt_file(data, filename, sampling_rate=None)[source]

Write data in text format.

Parameters
  • data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.

  • filename (str) – Path to file where to write the data.

  • sampling_rate (None) – Not used, just here for interface compatibility.

Return type

None

Example

>>> tmpdir = getfixture('tmpdir')
>>> signal=torch.tensor([1,2,3,4])
>>> write_txt_file(signal, os.path.join(tmpdir, 'example.txt'))
speechbrain.dataio.dataio.write_stdout(data, filename=None, sampling_rate=None)[source]

Write data to standard output.

Parameters
  • data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.

  • filename (None) – Not used, just here for compatibility.

  • sampling_rate (None) – Not used, just here for compatibility.

Return type

None

Example

>>> tmpdir = getfixture('tmpdir')
>>> signal = torch.tensor([[1,2,3,4]])
>>> write_stdout(signal, tmpdir + '/example.txt')
[1, 2, 3, 4]
speechbrain.dataio.dataio.length_to_mask(length, max_len=None, dtype=None, device=None)[source]

Creates a binary mask for each sequence.

Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3

Parameters
  • length (torch.LongTensor) – Containing the length of each sequence in the batch. Must be 1D.

  • max_len (int) – Max length for the mask, also the size of the second dimension.

  • dtype (torch.dtype, default: None) – The dtype of the generated mask.

  • device (torch.device, default: None) – The device to put the mask variable.

Returns

mask – The binary mask.

Return type

tensor

Example

>>> length=torch.Tensor([1,2,3])
>>> mask=length_to_mask(length)
>>> mask
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
speechbrain.dataio.dataio.read_kaldi_lab(kaldi_ali, kaldi_lab_opts)[source]

Read labels in kaldi format.

Uses kaldi IO.

Parameters
  • kaldi_ali (str) – Path to directory where kaldi alignments are stored.

  • kaldi_lab_opts (str) – A string that contains the options for reading the kaldi alignments.

Returns

lab – A dictionary containing the labels.

Return type

dict

Note

This depends on kaldi-io-for-python. Install it separately. See: https://github.com/vesis84/kaldi-io-for-python

Example

This example requires kaldi files. ` lab_folder = '/home/kaldi/egs/TIMIT/s5/exp/dnn4_pretrain-dbn_dnn_ali' read_kaldi_lab(lab_folder, 'ali-to-pdf') `

speechbrain.dataio.dataio.get_md5(file)[source]

Get the md5 checksum of an input file.

Parameters

file (str) – Path to file for which compute the checksum.

Returns

Checksum for the given filepath.

Return type

md5

Example

>>> get_md5('samples/audio_samples/example1.wav')
'c482d0081ca35302d30d12f1136c34e5'
speechbrain.dataio.dataio.save_md5(files, out_file)[source]

Saves the md5 of a list of input files as a pickled dict into a file.

Parameters
  • files (list) – List of input files from which we will compute the md5.

  • outfile (str) – The path where to store the output pkl file.

Returns

  • None

  • Example

  • >>> files = [‘samples/audio_samples/example1.wav’]

  • >>> tmpdir = getfixture(‘tmpdir’)

  • >>> save_md5(files, os.path.join(tmpdir, “md5.pkl”))

speechbrain.dataio.dataio.save_pkl(obj, file)[source]

Save an object in pkl format.

Parameters
  • obj (object) – Object to save in pkl format

  • file (str) – Path to the output file

  • sampling_rate (int) – Sampling rate of the audio file, TODO: this is not used?

Example

>>> tmpfile = os.path.join(getfixture('tmpdir'), "example.pkl")
>>> save_pkl([1, 2, 3, 4, 5], tmpfile)
>>> load_pkl(tmpfile)
[1, 2, 3, 4, 5]
speechbrain.dataio.dataio.load_pkl(file)[source]

Loads a pkl file.

For an example, see save_pkl.

Parameters

file (str) – Path to the input pkl file.

Return type

The loaded object.

speechbrain.dataio.dataio.prepend_bos_token(label, bos_index)[source]

Create labels with <bos> token at the beginning.

Parameters
  • label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length].

  • bos_index (int) – The index for <bos> token.

Returns

new_label – The new label with <bos> at the beginning.

Return type

tensor

Example

>>> label=torch.LongTensor([[1,0,0], [2,3,0], [4,5,6]])
>>> new_label=prepend_bos_token(label, bos_index=7)
>>> new_label
tensor([[7, 1, 0, 0],
        [7, 2, 3, 0],
        [7, 4, 5, 6]])
speechbrain.dataio.dataio.append_eos_token(label, length, eos_index)[source]

Create labels with <eos> token appended.

Parameters
  • label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length]

  • length (torch.LongTensor) – Containing the original length of each label sequences. Must be 1D.

  • eos_index (int) – The index for <eos> token.

Returns

new_label – The new label with <eos> appended.

Return type

tensor

Example

>>> label=torch.IntTensor([[1,0,0], [2,3,0], [4,5,6]])
>>> length=torch.LongTensor([1,2,3])
>>> new_label=append_eos_token(label, length, eos_index=7)
>>> new_label
tensor([[1, 7, 0, 0],
        [2, 3, 7, 0],
        [4, 5, 6, 7]], dtype=torch.int32)
speechbrain.dataio.dataio.merge_char(sequences, space='_')[source]

Merge characters sequences into word sequences.

Parameters
  • sequences (list) – Each item contains a list, and this list contains a character sequence.

  • space (string) – The token represents space. Default: _

Return type

The list contains word sequences for each sentence.

Example

>>> sequences = [["a", "b", "_", "c", "_", "d", "e"], ["e", "f", "g", "_", "h", "i"]]
>>> results = merge_char(sequences)
>>> results
[['ab', 'c', 'de'], ['efg', 'hi']]
speechbrain.dataio.dataio.merge_csvs(data_folder, csv_lst, merged_csv)[source]

Merging several csv files into one file.

Parameters
  • data_folder (string) – The folder to store csv files to be merged and after merging.

  • csv_lst (list) – Filenames of csv file to be merged.

  • merged_csv (string) – The filename to write the merged csv file.

Example

>>> merge_csvs("samples/audio_samples/",
... ["csv_example.csv", "csv_example2.csv"],
... "test_csv_merge.csv")
speechbrain.dataio.dataio.split_word(sequences, space='_')[source]

Split word sequences into character sequences.

Parameters
  • sequences (list) – Each item contains a list, and this list contains a words sequence.

  • space (string) – The token represents space. Default: _

Return type

The list contains word sequences for each sentence.

Example

>>> sequences = [['ab', 'c', 'de'], ['efg', 'hi']]
>>> results = split_word(sequences)
>>> results
[['a', 'b', '_', 'c', '_', 'd', 'e'], ['e', 'f', 'g', '_', 'h', 'i']]