speechbrain.dataio.dataio module

Data reading and writing.

Authors
  • Mirco Ravanelli 2020

  • Aku Rouhe 2020

  • Ju-Chieh Chou 2020

  • Samuele Cornell 2020

  • Abdel HEBA 2020

  • Sylvain de Langen 2022

Summary

Classes:

IterativeCSVWriter

Write CSV files a line at a time.

Functions:

append_eos_token

Create labels with <eos> token appended.

convert_index_to_lab

Convert a batch of integer IDs to string labels.

get_md5

Get the md5 checksum of an input file.

length_to_mask

Creates a binary mask for each sequence.

load_data_csv

Loads CSV and formats string values.

load_data_json

Loads JSON and recursively formats string values.

load_pickle

Utility function for loading .pkl pickle files.

load_pkl

Loads a pkl file.

merge_char

Merge characters sequences into word sequences.

merge_csvs

Merging several csv files into one file.

prepend_bos_token

Create labels with <bos> token at the beginning.

read_audio

General audio loading, based on a custom notation.

read_audio_multichannel

General audio loading, based on a custom notation.

read_kaldi_lab

Read labels in kaldi format.

relative_time_to_absolute

Converts SpeechBrain style relative length to the absolute duration.

save_md5

Saves the md5 of a list of input files as a pickled dict into a file.

save_pkl

Save an object in pkl format.

split_word

Split word sequences into character sequences.

to_doubleTensor

param x

Input data to be converted to torch double.

to_floatTensor

param x

Input data to be converted to torch float.

to_longTensor

param x

Input data to be converted to torch long.

write_audio

Write audio on disk.

write_stdout

Write data to standard output.

write_txt_file

Write data in text format.

Reference

speechbrain.dataio.dataio.load_data_json(json_path, replacements={})[source]

Loads JSON and recursively formats string values.

Parameters
  • json_path (str) – Path to CSV file.

  • replacements (dict) – (Optional dict), e.g., {“data_folder”: “/home/speechbrain/data”}. This is used to recursively format all string values in the data.

Returns

JSON data with replacements applied.

Return type

dict

Example

>>> json_spec = '''{
...   "ex1": {"files": ["{ROOT}/mic1/ex1.wav", "{ROOT}/mic2/ex1.wav"], "id": 1},
...   "ex2": {"files": [{"spk1": "{ROOT}/ex2.wav"}, {"spk2": "{ROOT}/ex2.wav"}], "id": 2}
... }
... '''
>>> tmpfile = getfixture('tmpdir') / "test.json"
>>> with open(tmpfile, "w") as fo:
...     _ = fo.write(json_spec)
>>> data = load_data_json(tmpfile, {"ROOT": "/home"})
>>> data["ex1"]["files"][0]
'/home/mic1/ex1.wav'
>>> data["ex2"]["files"][1]["spk2"]
'/home/ex2.wav'
speechbrain.dataio.dataio.load_data_csv(csv_path, replacements={})[source]

Loads CSV and formats string values.

Uses the SpeechBrain legacy CSV data format, where the CSV must have an ‘ID’ field. If there is a field called duration, it is interpreted as a float. The rest of the fields are left as they are (legacy _format and _opts fields are not used to load the data in any special way).

Bash-like string replacements with $to_replace are supported.

Parameters
  • csv_path (str) – Path to CSV file.

  • replacements (dict) – (Optional dict), e.g., {“data_folder”: “/home/speechbrain/data”} This is used to recursively format all string values in the data.

Returns

CSV data with replacements applied.

Return type

dict

Example

>>> csv_spec = '''ID,duration,wav_path
... utt1,1.45,$data_folder/utt1.wav
... utt2,2.0,$data_folder/utt2.wav
... '''
>>> tmpfile = getfixture("tmpdir") / "test.csv"
>>> with open(tmpfile, "w") as fo:
...     _ = fo.write(csv_spec)
>>> data = load_data_csv(tmpfile, {"data_folder": "/home"})
>>> data["utt1"]["wav_path"]
'/home/utt1.wav'
speechbrain.dataio.dataio.read_audio(waveforms_obj)[source]

General audio loading, based on a custom notation.

Expected use case is in conjunction with Datasets specified by JSON.

The parameter may just be a path to a file: read_audio(“/path/to/wav1.wav”)

Alternatively, you can specify more options in a dict, e.g.: ``` # load a file from sample 8000 through 15999 read_audio({

“file”: “/path/to/wav2.wav”, “start”: 8000, “stop”: 16000

})

Which codecs are supported depends on your torchaudio backend. Refer to torchaudio.load documentation for further details.

param waveforms_obj

Path to audio or dict with the desired configuration.

Keys for the dict variant: - “file” (str): Path to the audio file. - “start” (int, optional): The first sample to load. If unspecified, load from the very first frame. - “stop” (int, optional): The last sample to load (exclusive). If unspecified or equal to start, load from start to the end. Will not fail if stop is past the sample count of the file and will return less frames.

type waveforms_obj

str, dict

returns

1-channel: audio tensor with shape: (samples, ). >=2-channels: audio tensor with shape: (samples, channels).

rtype

torch.Tensor

Example

>>> dummywav = torch.rand(16000)
>>> import os
>>> tmpfile = str(getfixture('tmpdir') / "wave.wav")
>>> write_audio(tmpfile, dummywav, 16000)
>>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"}
>>> loaded = read_audio(asr_example["wav"])
>>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend
True
speechbrain.dataio.dataio.read_audio_multichannel(waveforms_obj)[source]

General audio loading, based on a custom notation.

Expected use case is in conjunction with Datasets specified by JSON.

The custom notation:

The annotation can be just a path to a file: “/path/to/wav1.wav”

Multiple (possibly multi-channel) files can be specified, as long as they have the same length: {“files”: [

“/path/to/wav1.wav”, “/path/to/wav2.wav” ]

}

Or you can specify a single file more succinctly: {“files”: “/path/to/wav2.wav”}

Offset number samples and stop number samples also can be specified to read only a segment within the files. {“files”: [

“/path/to/wav1.wav”, “/path/to/wav2.wav” ]

“start”: 8000 “stop”: 16000 }

Parameters

waveforms_obj (str, dict) – Audio reading annotation, see above for format.

Returns

Audio tensor with shape: (samples, ).

Return type

torch.Tensor

Example

>>> dummywav = torch.rand(16000, 2)
>>> import os
>>> tmpfile = str(getfixture('tmpdir') / "wave.wav")
>>> write_audio(tmpfile, dummywav, 16000)
>>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"}
>>> loaded = read_audio(asr_example["wav"])
>>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend
True
speechbrain.dataio.dataio.write_audio(filepath, audio, samplerate)[source]

Write audio on disk. It is basically a wrapper to support saving audio signals in the speechbrain format (audio, channels).

Parameters
  • filepath (path) – Path where to save the audio file.

  • audio (torch.Tensor) – Audio file in the expected speechbrain format (signal, channels).

  • samplerate (int) – Sample rate (e.g., 16000).

Example

>>> import os
>>> tmpfile = str(getfixture('tmpdir') / "wave.wav")
>>> dummywav = torch.rand(16000, 2)
>>> write_audio(tmpfile, dummywav, 16000)
>>> loaded = read_audio(tmpfile)
>>> loaded.allclose(dummywav,atol=1e-4) # replace with eq with sox_io backend
True
speechbrain.dataio.dataio.load_pickle(pickle_path)[source]

Utility function for loading .pkl pickle files.

Parameters

pickle_path (str) – Path to pickle file.

Returns

out – Python object loaded from pickle.

Return type

object

speechbrain.dataio.dataio.to_floatTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
Parameters

x ((list, tuple, np.ndarray)) – Input data to be converted to torch float.

Returns

tensor – Data now in torch.tensor float datatype.

Return type

torch.tensor

speechbrain.dataio.dataio.to_doubleTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
Parameters

x ((list, tuple, np.ndarray)) – Input data to be converted to torch double.

Returns

tensor – Data now in torch.tensor double datatype.

Return type

torch.tensor

speechbrain.dataio.dataio.to_longTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
Parameters

x ((list, tuple, np.ndarray)) – Input data to be converted to torch long.

Returns

tensor – Data now in torch.tensor long datatype.

Return type

torch.tensor

speechbrain.dataio.dataio.convert_index_to_lab(batch, ind2lab)[source]

Convert a batch of integer IDs to string labels.

Parameters
  • batch (list) – List of lists, a batch of sequences.

  • ind2lab (dict) – Mapping from integer IDs to labels.

Returns

List of lists, same size as batch, with labels from ind2lab.

Return type

list

Example

>>> ind2lab = {1: "h", 2: "e", 3: "l", 4: "o"}
>>> out = convert_index_to_lab([[4,1], [1,2,3,3,4]], ind2lab)
>>> for seq in out:
...     print("".join(seq))
oh
hello
speechbrain.dataio.dataio.relative_time_to_absolute(batch, relative_lens, rate)[source]

Converts SpeechBrain style relative length to the absolute duration.

Operates on batch level.

Parameters
  • batch (torch.tensor) – Sequences to determine the duration for.

  • relative_lens (torch.tensor) – The relative length of each sequence in batch. The longest sequence in the batch needs to have relative length 1.0.

  • rate (float) – The rate at which sequence elements occur in real-world time. Sample rate, if batch is raw wavs (recommended) or 1/frame_shift if batch is features. This has to have 1/s as the unit.

Returns

Duration of each sequence in seconds.

Return type

torch.tensor

Example

>>> batch = torch.ones(2, 16000)
>>> relative_lens = torch.tensor([3./4., 1.0])
>>> rate = 16000
>>> print(relative_time_to_absolute(batch, relative_lens, rate))
tensor([0.7500, 1.0000])
class speechbrain.dataio.dataio.IterativeCSVWriter(outstream, data_fields, defaults={})[source]

Bases: object

Write CSV files a line at a time.

Parameters
  • outstream (file-object) – A writeable stream

  • data_fields (list) – List of the optional keys to write. Each key will be expanded to the SpeechBrain format, producing three fields: key, key_format, key_opts.

Example

>>> import io
>>> f = io.StringIO()
>>> writer = IterativeCSVWriter(f, ["phn"])
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
>>> writer.write("UTT1",2.5,"sil hh ee ll ll oo sil","string","")
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
>>> writer.write(ID="UTT2",phn="sil ww oo rr ll dd sil",phn_format="string")
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
UTT2,,sil ww oo rr ll dd sil,string,
>>> writer.set_default('phn_format', 'string')
>>> writer.write_batch(ID=["UTT3","UTT4"],phn=["ff oo oo", "bb aa rr"])
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
UTT2,,sil ww oo rr ll dd sil,string,
UTT3,,ff oo oo,string,
UTT4,,bb aa rr,string,
set_default(field, value)[source]

Sets a default value for the given CSV field.

Parameters
  • field (str) – A field in the CSV.

  • value – The default value.

write(*args, **kwargs)[source]

Writes one data line into the CSV.

Parameters
  • *args – Supply every field with a value in positional form OR.

  • **kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.

write_batch(*args, **kwargs)[source]

Writes a batch of lines into the CSV.

Here each argument should be a list with the same length.

Parameters
  • *args – Supply every field with a value in positional form OR.

  • **kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.

speechbrain.dataio.dataio.write_txt_file(data, filename, sampling_rate=None)[source]

Write data in text format.

Parameters
  • data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.

  • filename (str) – Path to file where to write the data.

  • sampling_rate (None) – Not used, just here for interface compatibility.

Return type

None

Example

>>> tmpdir = getfixture('tmpdir')
>>> signal=torch.tensor([1,2,3,4])
>>> write_txt_file(signal, tmpdir / 'example.txt')
speechbrain.dataio.dataio.write_stdout(data, filename=None, sampling_rate=None)[source]

Write data to standard output.

Parameters
  • data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.

  • filename (None) – Not used, just here for compatibility.

  • sampling_rate (None) – Not used, just here for compatibility.

Return type

None

Example

>>> tmpdir = getfixture('tmpdir')
>>> signal = torch.tensor([[1,2,3,4]])
>>> write_stdout(signal, tmpdir / 'example.txt')
[1, 2, 3, 4]
speechbrain.dataio.dataio.length_to_mask(length, max_len=None, dtype=None, device=None)[source]

Creates a binary mask for each sequence.

Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3

Parameters
  • length (torch.LongTensor) – Containing the length of each sequence in the batch. Must be 1D.

  • max_len (int) – Max length for the mask, also the size of the second dimension.

  • dtype (torch.dtype, default: None) – The dtype of the generated mask.

  • device (torch.device, default: None) – The device to put the mask variable.

Returns

mask – The binary mask.

Return type

tensor

Example

>>> length=torch.Tensor([1,2,3])
>>> mask=length_to_mask(length)
>>> mask
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
speechbrain.dataio.dataio.read_kaldi_lab(kaldi_ali, kaldi_lab_opts)[source]

Read labels in kaldi format.

Uses kaldi IO.

Parameters
  • kaldi_ali (str) – Path to directory where kaldi alignments are stored.

  • kaldi_lab_opts (str) – A string that contains the options for reading the kaldi alignments.

Returns

lab – A dictionary containing the labels.

Return type

dict

Note

This depends on kaldi-io-for-python. Install it separately. See: https://github.com/vesis84/kaldi-io-for-python

Example

This example requires kaldi files. ` lab_folder = '/home/kaldi/egs/TIMIT/s5/exp/dnn4_pretrain-dbn_dnn_ali' read_kaldi_lab(lab_folder, 'ali-to-pdf') `

speechbrain.dataio.dataio.get_md5(file)[source]

Get the md5 checksum of an input file.

Parameters

file (str) – Path to file for which compute the checksum.

Returns

Checksum for the given filepath.

Return type

md5

Example

>>> get_md5('tests/samples/single-mic/example1.wav')
'c482d0081ca35302d30d12f1136c34e5'
speechbrain.dataio.dataio.save_md5(files, out_file)[source]

Saves the md5 of a list of input files as a pickled dict into a file.

Parameters
  • files (list) – List of input files from which we will compute the md5.

  • outfile (str) – The path where to store the output pkl file.

Returns

  • None

  • Example

  • >>> files = [‘tests/samples/single-mic/example1.wav’]

  • >>> tmpdir = getfixture(‘tmpdir’)

  • >>> save_md5(files, tmpdir / “md5.pkl”)

speechbrain.dataio.dataio.save_pkl(obj, file)[source]

Save an object in pkl format.

Parameters
  • obj (object) – Object to save in pkl format

  • file (str) – Path to the output file

  • sampling_rate (int) – Sampling rate of the audio file, TODO: this is not used?

Example

>>> tmpfile = getfixture('tmpdir') / "example.pkl"
>>> save_pkl([1, 2, 3, 4, 5], tmpfile)
>>> load_pkl(tmpfile)
[1, 2, 3, 4, 5]
speechbrain.dataio.dataio.load_pkl(file)[source]

Loads a pkl file.

For an example, see save_pkl.

Parameters

file (str) – Path to the input pkl file.

Return type

The loaded object.

speechbrain.dataio.dataio.prepend_bos_token(label, bos_index)[source]

Create labels with <bos> token at the beginning.

Parameters
  • label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length].

  • bos_index (int) – The index for <bos> token.

Returns

new_label – The new label with <bos> at the beginning.

Return type

tensor

Example

>>> label=torch.LongTensor([[1,0,0], [2,3,0], [4,5,6]])
>>> new_label=prepend_bos_token(label, bos_index=7)
>>> new_label
tensor([[7, 1, 0, 0],
        [7, 2, 3, 0],
        [7, 4, 5, 6]])
speechbrain.dataio.dataio.append_eos_token(label, length, eos_index)[source]

Create labels with <eos> token appended.

Parameters
  • label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length]

  • length (torch.LongTensor) – Containing the original length of each label sequences. Must be 1D.

  • eos_index (int) – The index for <eos> token.

Returns

new_label – The new label with <eos> appended.

Return type

tensor

Example

>>> label=torch.IntTensor([[1,0,0], [2,3,0], [4,5,6]])
>>> length=torch.LongTensor([1,2,3])
>>> new_label=append_eos_token(label, length, eos_index=7)
>>> new_label
tensor([[1, 7, 0, 0],
        [2, 3, 7, 0],
        [4, 5, 6, 7]], dtype=torch.int32)
speechbrain.dataio.dataio.merge_char(sequences, space='_')[source]

Merge characters sequences into word sequences.

Parameters
  • sequences (list) – Each item contains a list, and this list contains a character sequence.

  • space (string) – The token represents space. Default: _

Return type

The list contains word sequences for each sentence.

Example

>>> sequences = [["a", "b", "_", "c", "_", "d", "e"], ["e", "f", "g", "_", "h", "i"]]
>>> results = merge_char(sequences)
>>> results
[['ab', 'c', 'de'], ['efg', 'hi']]
speechbrain.dataio.dataio.merge_csvs(data_folder, csv_lst, merged_csv)[source]

Merging several csv files into one file.

Parameters
  • data_folder (string) – The folder to store csv files to be merged and after merging.

  • csv_lst (list) – Filenames of csv file to be merged.

  • merged_csv (string) – The filename to write the merged csv file.

Example

>>> tmpdir = getfixture('tmpdir')
>>> os.symlink(os.path.realpath("tests/samples/annotation/speech.csv"), tmpdir / "speech.csv")
>>> merge_csvs(tmpdir,
... ["speech.csv", "speech.csv"],
... "test_csv_merge.csv")
speechbrain.dataio.dataio.split_word(sequences, space='_')[source]

Split word sequences into character sequences.

Parameters
  • sequences (list) – Each item contains a list, and this list contains a words sequence.

  • space (string) – The token represents space. Default: _

Return type

The list contains word sequences for each sentence.

Example

>>> sequences = [['ab', 'c', 'de'], ['efg', 'hi']]
>>> results = split_word(sequences)
>>> results
[['a', 'b', '_', 'c', '_', 'd', 'e'], ['e', 'f', 'g', '_', 'h', 'i']]