speechbrain.dataio.dataio module

Data reading and writing.

Authors
  • Mirco Ravanelli 2020

  • Aku Rouhe 2020

  • Ju-Chieh Chou 2020

  • Samuele Cornell 2020

  • Abdel HEBA 2020

  • Gaelle Laperriere 2021

  • Sahar Ghannay 2021

  • Sylvain de Langen 2022

Summary

Classes:

IterativeCSVWriter

Write CSV files a line at a time.

Functions:

append_eos_token

Create labels with <eos> token appended.

convert_index_to_lab

Convert a batch of integer IDs to string labels.

extract_concepts_values

keep the semantic concepts and values for evaluation.

get_md5

Get the md5 checksum of an input file.

length_to_mask

Creates a binary mask for each sequence.

load_data_csv

Loads CSV and formats string values.

load_data_json

Loads JSON and recursively formats string values.

load_pickle

Utility function for loading .pkl pickle files.

load_pkl

Loads a pkl file.

merge_char

Merge characters sequences into word sequences.

merge_csvs

Merging several csv files into one file.

prepend_bos_token

Create labels with <bos> token at the beginning.

read_audio

General audio loading, based on a custom notation.

read_audio_info

Retrieves audio metadata from a file path.

read_audio_multichannel

General audio loading, based on a custom notation.

read_kaldi_lab

Read labels in kaldi format.

relative_time_to_absolute

Converts SpeechBrain style relative length to the absolute duration.

save_md5

Saves the md5 of a list of input files as a pickled dict into a file.

save_pkl

Save an object in pkl format.

split_word

Split word sequences into character sequences.

to_doubleTensor

param x:

Input data to be converted to torch double.

to_floatTensor

param x:

Input data to be converted to torch float.

to_longTensor

param x:

Input data to be converted to torch long.

write_audio

Write audio on disk.

write_stdout

Write data to standard output.

write_txt_file

Write data in text format.

Reference

speechbrain.dataio.dataio.load_data_json(json_path, replacements={})[source]

Loads JSON and recursively formats string values.

Parameters:
  • json_path (str) – Path to CSV file.

  • replacements (dict) – (Optional dict), e.g., {“data_folder”: “/home/speechbrain/data”}. This is used to recursively format all string values in the data.

Returns:

JSON data with replacements applied.

Return type:

dict

Example

>>> json_spec = '''{
...   "ex1": {"files": ["{ROOT}/mic1/ex1.wav", "{ROOT}/mic2/ex1.wav"], "id": 1},
...   "ex2": {"files": [{"spk1": "{ROOT}/ex2.wav"}, {"spk2": "{ROOT}/ex2.wav"}], "id": 2}
... }
... '''
>>> tmpfile = getfixture('tmpdir') / "test.json"
>>> with open(tmpfile, "w") as fo:
...     _ = fo.write(json_spec)
>>> data = load_data_json(tmpfile, {"ROOT": "/home"})
>>> data["ex1"]["files"][0]
'/home/mic1/ex1.wav'
>>> data["ex2"]["files"][1]["spk2"]
'/home/ex2.wav'
speechbrain.dataio.dataio.load_data_csv(csv_path, replacements={})[source]

Loads CSV and formats string values.

Uses the SpeechBrain legacy CSV data format, where the CSV must have an ‘ID’ field. If there is a field called duration, it is interpreted as a float. The rest of the fields are left as they are (legacy _format and _opts fields are not used to load the data in any special way).

Bash-like string replacements with $to_replace are supported.

Parameters:
  • csv_path (str) – Path to CSV file.

  • replacements (dict) – (Optional dict), e.g., {“data_folder”: “/home/speechbrain/data”} This is used to recursively format all string values in the data.

Returns:

CSV data with replacements applied.

Return type:

dict

Example

>>> csv_spec = '''ID,duration,wav_path
... utt1,1.45,$data_folder/utt1.wav
... utt2,2.0,$data_folder/utt2.wav
... '''
>>> tmpfile = getfixture("tmpdir") / "test.csv"
>>> with open(tmpfile, "w") as fo:
...     _ = fo.write(csv_spec)
>>> data = load_data_csv(tmpfile, {"data_folder": "/home"})
>>> data["utt1"]["wav_path"]
'/home/utt1.wav'
speechbrain.dataio.dataio.read_audio_info(path) AudioMetaData[source]

Retrieves audio metadata from a file path. Behaves identically to torchaudio.info, but attempts to fix metadata (such as frame count) that is otherwise broken with certain torchaudio version and codec combinations.

Note that this may cause full file traversal in certain cases!

Parameters:

path (str) – Path to the audio file to examine.

Returns:

Same value as returned by torchaudio.info, but may eventually have num_frames corrected if it otherwise would have been == 0.

Return type:

torchaudio.backend.common.AudioMetaData

Note

Some codecs, such as MP3, require full file traversal for accurate length information to be retrieved. In these cases, you may as well read the entire audio file to avoid doubling the processing time.

speechbrain.dataio.dataio.read_audio(waveforms_obj)[source]

General audio loading, based on a custom notation.

Expected use case is in conjunction with Datasets specified by JSON.

The parameter may just be a path to a file: read_audio(“/path/to/wav1.wav”)

Alternatively, you can specify more options in a dict, e.g.: ``` # load a file from sample 8000 through 15999 read_audio({

“file”: “/path/to/wav2.wav”, “start”: 8000, “stop”: 16000

})

Which codecs are supported depends on your torchaudio backend. Refer to torchaudio.load documentation for further details.

param waveforms_obj:

Path to audio or dict with the desired configuration.

Keys for the dict variant: - “file” (str): Path to the audio file. - “start” (int, optional): The first sample to load. If unspecified, load from the very first frame. - “stop” (int, optional): The last sample to load (exclusive). If unspecified or equal to start, load from start to the end. Will not fail if stop is past the sample count of the file and will return less frames.

type waveforms_obj:

str, dict

returns:

1-channel: audio tensor with shape: (samples, ). >=2-channels: audio tensor with shape: (samples, channels).

rtype:

torch.Tensor

Example

>>> dummywav = torch.rand(16000)
>>> import os
>>> tmpfile = str(getfixture('tmpdir') / "wave.wav")
>>> write_audio(tmpfile, dummywav, 16000)
>>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"}
>>> loaded = read_audio(asr_example["wav"])
>>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend
True
speechbrain.dataio.dataio.read_audio_multichannel(waveforms_obj)[source]

General audio loading, based on a custom notation.

Expected use case is in conjunction with Datasets specified by JSON.

The custom notation:

The annotation can be just a path to a file: “/path/to/wav1.wav”

Multiple (possibly multi-channel) files can be specified, as long as they have the same length: {“files”: [

“/path/to/wav1.wav”, “/path/to/wav2.wav” ]

}

Or you can specify a single file more succinctly: {“files”: “/path/to/wav2.wav”}

Offset number samples and stop number samples also can be specified to read only a segment within the files. {“files”: [

“/path/to/wav1.wav”, “/path/to/wav2.wav” ]

“start”: 8000 “stop”: 16000 }

Parameters:

waveforms_obj (str, dict) – Audio reading annotation, see above for format.

Returns:

Audio tensor with shape: (samples, ).

Return type:

torch.Tensor

Example

>>> dummywav = torch.rand(16000, 2)
>>> import os
>>> tmpfile = str(getfixture('tmpdir') / "wave.wav")
>>> write_audio(tmpfile, dummywav, 16000)
>>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"}
>>> loaded = read_audio(asr_example["wav"])
>>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend
True
speechbrain.dataio.dataio.write_audio(filepath, audio, samplerate)[source]

Write audio on disk. It is basically a wrapper to support saving audio signals in the speechbrain format (audio, channels).

Parameters:
  • filepath (path) – Path where to save the audio file.

  • audio (torch.Tensor) – Audio file in the expected speechbrain format (signal, channels).

  • samplerate (int) – Sample rate (e.g., 16000).

Example

>>> import os
>>> tmpfile = str(getfixture('tmpdir') / "wave.wav")
>>> dummywav = torch.rand(16000, 2)
>>> write_audio(tmpfile, dummywav, 16000)
>>> loaded = read_audio(tmpfile)
>>> loaded.allclose(dummywav,atol=1e-4) # replace with eq with sox_io backend
True
speechbrain.dataio.dataio.load_pickle(pickle_path)[source]

Utility function for loading .pkl pickle files.

Parameters:

pickle_path (str) – Path to pickle file.

Returns:

out – Python object loaded from pickle.

Return type:

object

speechbrain.dataio.dataio.to_floatTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
Parameters:

x ((list, tuple, np.ndarray)) – Input data to be converted to torch float.

Returns:

tensor – Data now in torch.tensor float datatype.

Return type:

torch.tensor

speechbrain.dataio.dataio.to_doubleTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
Parameters:

x ((list, tuple, np.ndarray)) – Input data to be converted to torch double.

Returns:

tensor – Data now in torch.tensor double datatype.

Return type:

torch.tensor

speechbrain.dataio.dataio.to_longTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]
Parameters:

x ((list, tuple, np.ndarray)) – Input data to be converted to torch long.

Returns:

tensor – Data now in torch.tensor long datatype.

Return type:

torch.tensor

speechbrain.dataio.dataio.convert_index_to_lab(batch, ind2lab)[source]

Convert a batch of integer IDs to string labels.

Parameters:
  • batch (list) – List of lists, a batch of sequences.

  • ind2lab (dict) – Mapping from integer IDs to labels.

Returns:

List of lists, same size as batch, with labels from ind2lab.

Return type:

list

Example

>>> ind2lab = {1: "h", 2: "e", 3: "l", 4: "o"}
>>> out = convert_index_to_lab([[4,1], [1,2,3,3,4]], ind2lab)
>>> for seq in out:
...     print("".join(seq))
oh
hello
speechbrain.dataio.dataio.relative_time_to_absolute(batch, relative_lens, rate)[source]

Converts SpeechBrain style relative length to the absolute duration.

Operates on batch level.

Parameters:
  • batch (torch.tensor) – Sequences to determine the duration for.

  • relative_lens (torch.tensor) – The relative length of each sequence in batch. The longest sequence in the batch needs to have relative length 1.0.

  • rate (float) – The rate at which sequence elements occur in real-world time. Sample rate, if batch is raw wavs (recommended) or 1/frame_shift if batch is features. This has to have 1/s as the unit.

Returns:

Duration of each sequence in seconds.

Return type:

torch.tensor

Example

>>> batch = torch.ones(2, 16000)
>>> relative_lens = torch.tensor([3./4., 1.0])
>>> rate = 16000
>>> print(relative_time_to_absolute(batch, relative_lens, rate))
tensor([0.7500, 1.0000])
class speechbrain.dataio.dataio.IterativeCSVWriter(outstream, data_fields, defaults={})[source]

Bases: object

Write CSV files a line at a time.

Parameters:
  • outstream (file-object) – A writeable stream

  • data_fields (list) – List of the optional keys to write. Each key will be expanded to the SpeechBrain format, producing three fields: key, key_format, key_opts.

Example

>>> import io
>>> f = io.StringIO()
>>> writer = IterativeCSVWriter(f, ["phn"])
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
>>> writer.write("UTT1",2.5,"sil hh ee ll ll oo sil","string","")
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
>>> writer.write(ID="UTT2",phn="sil ww oo rr ll dd sil",phn_format="string")
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
UTT2,,sil ww oo rr ll dd sil,string,
>>> writer.set_default('phn_format', 'string')
>>> writer.write_batch(ID=["UTT3","UTT4"],phn=["ff oo oo", "bb aa rr"])
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
UTT2,,sil ww oo rr ll dd sil,string,
UTT3,,ff oo oo,string,
UTT4,,bb aa rr,string,
set_default(field, value)[source]

Sets a default value for the given CSV field.

Parameters:
  • field (str) – A field in the CSV.

  • value – The default value.

write(*args, **kwargs)[source]

Writes one data line into the CSV.

Parameters:
  • *args – Supply every field with a value in positional form OR.

  • **kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.

write_batch(*args, **kwargs)[source]

Writes a batch of lines into the CSV.

Here each argument should be a list with the same length.

Parameters:
  • *args – Supply every field with a value in positional form OR.

  • **kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.

speechbrain.dataio.dataio.write_txt_file(data, filename, sampling_rate=None)[source]

Write data in text format.

Parameters:
  • data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.

  • filename (str) – Path to file where to write the data.

  • sampling_rate (None) – Not used, just here for interface compatibility.

Return type:

None

Example

>>> tmpdir = getfixture('tmpdir')
>>> signal=torch.tensor([1,2,3,4])
>>> write_txt_file(signal, tmpdir / 'example.txt')
speechbrain.dataio.dataio.write_stdout(data, filename=None, sampling_rate=None)[source]

Write data to standard output.

Parameters:
  • data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.

  • filename (None) – Not used, just here for compatibility.

  • sampling_rate (None) – Not used, just here for compatibility.

Return type:

None

Example

>>> tmpdir = getfixture('tmpdir')
>>> signal = torch.tensor([[1,2,3,4]])
>>> write_stdout(signal, tmpdir / 'example.txt')
[1, 2, 3, 4]
speechbrain.dataio.dataio.length_to_mask(length, max_len=None, dtype=None, device=None)[source]

Creates a binary mask for each sequence.

Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3

Parameters:
  • length (torch.LongTensor) – Containing the length of each sequence in the batch. Must be 1D.

  • max_len (int) – Max length for the mask, also the size of the second dimension.

  • dtype (torch.dtype, default: None) – The dtype of the generated mask.

  • device (torch.device, default: None) – The device to put the mask variable.

Returns:

mask – The binary mask.

Return type:

tensor

Example

>>> length=torch.Tensor([1,2,3])
>>> mask=length_to_mask(length)
>>> mask
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
speechbrain.dataio.dataio.read_kaldi_lab(kaldi_ali, kaldi_lab_opts)[source]

Read labels in kaldi format.

Uses kaldi IO.

Parameters:
  • kaldi_ali (str) – Path to directory where kaldi alignments are stored.

  • kaldi_lab_opts (str) – A string that contains the options for reading the kaldi alignments.

Returns:

lab – A dictionary containing the labels.

Return type:

dict

Note

This depends on kaldi-io-for-python. Install it separately. See: https://github.com/vesis84/kaldi-io-for-python

Example

This example requires kaldi files. ` lab_folder = '/home/kaldi/egs/TIMIT/s5/exp/dnn4_pretrain-dbn_dnn_ali' read_kaldi_lab(lab_folder, 'ali-to-pdf') `

speechbrain.dataio.dataio.get_md5(file)[source]

Get the md5 checksum of an input file.

Parameters:

file (str) – Path to file for which compute the checksum.

Returns:

Checksum for the given filepath.

Return type:

md5

Example

>>> get_md5('tests/samples/single-mic/example1.wav')
'c482d0081ca35302d30d12f1136c34e5'
speechbrain.dataio.dataio.save_md5(files, out_file)[source]

Saves the md5 of a list of input files as a pickled dict into a file.

Parameters:
  • files (list) – List of input files from which we will compute the md5.

  • outfile (str) – The path where to store the output pkl file.

Returns:

  • None

  • Example

  • >>> files = [‘tests/samples/single-mic/example1.wav’]

  • >>> tmpdir = getfixture(‘tmpdir’)

  • >>> save_md5(files, tmpdir / “md5.pkl”)

speechbrain.dataio.dataio.save_pkl(obj, file)[source]

Save an object in pkl format.

Parameters:
  • obj (object) – Object to save in pkl format

  • file (str) – Path to the output file

  • sampling_rate (int) – Sampling rate of the audio file, TODO: this is not used?

Example

>>> tmpfile = getfixture('tmpdir') / "example.pkl"
>>> save_pkl([1, 2, 3, 4, 5], tmpfile)
>>> load_pkl(tmpfile)
[1, 2, 3, 4, 5]
speechbrain.dataio.dataio.load_pkl(file)[source]

Loads a pkl file.

For an example, see save_pkl.

Parameters:

file (str) – Path to the input pkl file.

Return type:

The loaded object.

speechbrain.dataio.dataio.prepend_bos_token(label, bos_index)[source]

Create labels with <bos> token at the beginning.

Parameters:
  • label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length].

  • bos_index (int) – The index for <bos> token.

Returns:

new_label – The new label with <bos> at the beginning.

Return type:

tensor

Example

>>> label=torch.LongTensor([[1,0,0], [2,3,0], [4,5,6]])
>>> new_label=prepend_bos_token(label, bos_index=7)
>>> new_label
tensor([[7, 1, 0, 0],
        [7, 2, 3, 0],
        [7, 4, 5, 6]])
speechbrain.dataio.dataio.append_eos_token(label, length, eos_index)[source]

Create labels with <eos> token appended.

Parameters:
  • label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length]

  • length (torch.LongTensor) – Containing the original length of each label sequences. Must be 1D.

  • eos_index (int) – The index for <eos> token.

Returns:

new_label – The new label with <eos> appended.

Return type:

tensor

Example

>>> label=torch.IntTensor([[1,0,0], [2,3,0], [4,5,6]])
>>> length=torch.LongTensor([1,2,3])
>>> new_label=append_eos_token(label, length, eos_index=7)
>>> new_label
tensor([[1, 7, 0, 0],
        [2, 3, 7, 0],
        [4, 5, 6, 7]], dtype=torch.int32)
speechbrain.dataio.dataio.merge_char(sequences, space='_')[source]

Merge characters sequences into word sequences.

Parameters:
  • sequences (list) – Each item contains a list, and this list contains a character sequence.

  • space (string) – The token represents space. Default: _

Return type:

The list contains word sequences for each sentence.

Example

>>> sequences = [["a", "b", "_", "c", "_", "d", "e"], ["e", "f", "g", "_", "h", "i"]]
>>> results = merge_char(sequences)
>>> results
[['ab', 'c', 'de'], ['efg', 'hi']]
speechbrain.dataio.dataio.merge_csvs(data_folder, csv_lst, merged_csv)[source]

Merging several csv files into one file.

Parameters:
  • data_folder (string) – The folder to store csv files to be merged and after merging.

  • csv_lst (list) – Filenames of csv file to be merged.

  • merged_csv (string) – The filename to write the merged csv file.

Example

>>> tmpdir = getfixture('tmpdir')
>>> os.symlink(os.path.realpath("tests/samples/annotation/speech.csv"), tmpdir / "speech.csv")
>>> merge_csvs(tmpdir,
... ["speech.csv", "speech.csv"],
... "test_csv_merge.csv")
speechbrain.dataio.dataio.split_word(sequences, space='_')[source]

Split word sequences into character sequences.

Parameters:
  • sequences (list) – Each item contains a list, and this list contains a words sequence.

  • space (string) – The token represents space. Default: _

Return type:

The list contains word sequences for each sentence.

Example

>>> sequences = [['ab', 'c', 'de'], ['efg', 'hi']]
>>> results = split_word(sequences)
>>> results
[['a', 'b', '_', 'c', '_', 'd', 'e'], ['e', 'f', 'g', '_', 'h', 'i']]
speechbrain.dataio.dataio.extract_concepts_values(sequences, keep_values, tag_in, tag_out, space)[source]

keep the semantic concepts and values for evaluation.

Parameters:
  • sequences (list) – Each item contains a list, and this list contains a character sequence.

  • keep_values (bool) – If True, keep the values. If not don’t.

  • tag_in (char) – Indicates the start of the concept.

  • tag_out (char) – Indicates the end of the concept.

  • space (string) – The token represents space. Default: _

Return type:

The list contains concept and value sequences for each sentence.

Example

>>> sequences = [['<reponse>','_','n','o','_','>','_','<localisation-ville>','_','L','e','_','M','a','n','s','_','>'], ['<reponse>','_','s','i','_','>'],['v','a','_','b','e','n','e']]
>>> results = extract_concepts_values(sequences, True, '<', '>', '_')
>>> results
[['<reponse> no', '<localisation-ville> Le Mans'], ['<reponse> si'], ['']]