speechbrain.dataio.dataio module¶
Data reading and writing.
- Authors
Mirco Ravanelli 2020
Aku Rouhe 2020
Ju-Chieh Chou 2020
Samuele Cornell 2020
Abdel HEBA 2020
Summary¶
Classes:
Write CSV files a line at a time. |
Functions:
Create labels with <eos> token appended. |
|
Convert a batch of integer IDs to string labels. |
|
Get the md5 checksum of an input file. |
|
Creates a binary mask for each sequence. |
|
Loads CSV and formats string values. |
|
Loads JSON and recursively formats string values. |
|
Utility function for loading .pkl pickle files. |
|
Loads a pkl file. |
|
Merge characters sequences into word sequences. |
|
Merging several csv files into one file. |
|
Create labels with <bos> token at the beginning. |
|
General audio loading, based on a custom notation. |
|
General audio loading, based on a custom notation. |
|
Read labels in kaldi format. |
|
Converts SpeechBrain style relative length to the absolute duration. |
|
Saves the md5 of a list of input files as a pickled dict into a file. |
|
Save an object in pkl format. |
|
Split word sequences into character sequences. |
|
|
|
|
|
|
|
Write audio on disk. |
|
Write data to standard output. |
|
Write data in text format. |
Reference¶
- speechbrain.dataio.dataio.load_data_json(json_path, replacements={})[source]¶
Loads JSON and recursively formats string values.
- Parameters
- Returns
JSON data with replacements applied.
- Return type
Example
>>> json_spec = '''{ ... "ex1": {"files": ["{ROOT}/mic1/ex1.wav", "{ROOT}/mic2/ex1.wav"], "id": 1}, ... "ex2": {"files": [{"spk1": "{ROOT}/ex2.wav"}, {"spk2": "{ROOT}/ex2.wav"}], "id": 2} ... } ... ''' >>> tmpfile = getfixture('tmpdir') / "test.json" >>> with open(tmpfile, "w") as fo: ... _ = fo.write(json_spec) >>> data = load_data_json(tmpfile, {"ROOT": "/home"}) >>> data["ex1"]["files"][0] '/home/mic1/ex1.wav' >>> data["ex2"]["files"][1]["spk2"] '/home/ex2.wav'
- speechbrain.dataio.dataio.load_data_csv(csv_path, replacements={})[source]¶
Loads CSV and formats string values.
Uses the SpeechBrain legacy CSV data format, where the CSV must have an ‘ID’ field. If there is a field called duration, it is interpreted as a float. The rest of the fields are left as they are (legacy _format and _opts fields are not used to load the data in any special way).
Bash-like string replacements with $to_replace are supported.
- Parameters
- Returns
CSV data with replacements applied.
- Return type
Example
>>> csv_spec = '''ID,duration,wav_path ... utt1,1.45,$data_folder/utt1.wav ... utt2,2.0,$data_folder/utt2.wav ... ''' >>> tmpfile = getfixture("tmpdir") / "test.csv" >>> with open(tmpfile, "w") as fo: ... _ = fo.write(csv_spec) >>> data = load_data_csv(tmpfile, {"data_folder": "/home"}) >>> data["utt1"]["wav_path"] '/home/utt1.wav'
- speechbrain.dataio.dataio.read_audio(waveforms_obj)[source]¶
General audio loading, based on a custom notation.
Expected use case is in conjunction with Datasets specified by JSON.
The custom notation:
The annotation can be just a path to a file: “/path/to/wav1.wav”
Or can specify more options in a dict: {“file”: “/path/to/wav2.wav”, “start”: 8000, “stop”: 16000 }
- Parameters
waveforms_obj (str, dict) – Audio reading annotation, see above for format.
- Returns
Audio tensor with shape: (samples, ).
- Return type
Example
>>> dummywav = torch.rand(16000) >>> import os >>> tmpfile = os.path.join(str(getfixture('tmpdir')), "wave.wav") >>> write_audio(tmpfile, dummywav, 16000) >>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"} >>> loaded = read_audio(asr_example["wav"]) >>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend True
- speechbrain.dataio.dataio.read_audio_multichannel(waveforms_obj)[source]¶
General audio loading, based on a custom notation.
Expected use case is in conjunction with Datasets specified by JSON.
The custom notation:
The annotation can be just a path to a file: “/path/to/wav1.wav”
Multiple (possibly multi-channel) files can be specified, as long as they have the same length: {“files”: [
“/path/to/wav1.wav”, “/path/to/wav2.wav” ]
}
Or you can specify a single file more succinctly: {“files”: “/path/to/wav2.wav”}
Offset number samples and stop number samples also can be specified to read only a segment within the files. {“files”: [
“/path/to/wav1.wav”, “/path/to/wav2.wav” ]
“start”: 8000 “stop”: 16000 }
- Parameters
waveforms_obj (str, dict) – Audio reading annotation, see above for format.
- Returns
Audio tensor with shape: (samples, ).
- Return type
Example
>>> dummywav = torch.rand(16000, 2) >>> import os >>> tmpfile = os.path.join(str(getfixture('tmpdir')), "wave.wav") >>> write_audio(tmpfile, dummywav, 16000) >>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"} >>> loaded = read_audio(asr_example["wav"]) >>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend True
- speechbrain.dataio.dataio.write_audio(filepath, audio, samplerate)[source]¶
Write audio on disk. It is basically a wrapper to support saving audio signals in the speechbrain format (audio, channels).
- Parameters
filepath (path) – Path where to save the audio file.
audio (torch.Tensor) – Audio file in the expected speechbrain format (signal, channels).
samplerate (int) – Sample rate (e.g., 16000).
Example
>>> import os >>> tmpfile = os.path.join(str(getfixture('tmpdir')), "wave.wav") >>> dummywav = torch.rand(16000, 2) >>> write_audio(tmpfile, dummywav, 16000) >>> loaded = read_audio(tmpfile) >>> loaded.allclose(dummywav,atol=1e-4) # replace with eq with sox_io backend True
- speechbrain.dataio.dataio.load_pickle(pickle_path)[source]¶
Utility function for loading .pkl pickle files.
- speechbrain.dataio.dataio.to_floatTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]¶
- speechbrain.dataio.dataio.to_doubleTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]¶
- speechbrain.dataio.dataio.to_longTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]¶
- speechbrain.dataio.dataio.convert_index_to_lab(batch, ind2lab)[source]¶
Convert a batch of integer IDs to string labels.
- Parameters
- Returns
List of lists, same size as batch, with labels from ind2lab.
- Return type
Example
>>> ind2lab = {1: "h", 2: "e", 3: "l", 4: "o"} >>> out = convert_index_to_lab([[4,1], [1,2,3,3,4]], ind2lab) >>> for seq in out: ... print("".join(seq)) oh hello
- speechbrain.dataio.dataio.relative_time_to_absolute(batch, relative_lens, rate)[source]¶
Converts SpeechBrain style relative length to the absolute duration.
Operates on batch level.
- Parameters
batch (torch.tensor) – Sequences to determine the duration for.
relative_lens (torch.tensor) – The relative length of each sequence in batch. The longest sequence in the batch needs to have relative length 1.0.
rate (float) – The rate at which sequence elements occur in real-world time. Sample rate, if batch is raw wavs (recommended) or 1/frame_shift if batch is features. This has to have 1/s as the unit.
- Returns
Duration of each sequence in seconds.
- Return type
torch.tensor
Example
>>> batch = torch.ones(2, 16000) >>> relative_lens = torch.tensor([3./4., 1.0]) >>> rate = 16000 >>> print(relative_time_to_absolute(batch, relative_lens, rate)) tensor([0.7500, 1.0000])
- class speechbrain.dataio.dataio.IterativeCSVWriter(outstream, data_fields, defaults={})[source]¶
Bases:
object
Write CSV files a line at a time.
- Parameters
outstream (file-object) – A writeable stream
data_fields (list) – List of the optional keys to write. Each key will be expanded to the SpeechBrain format, producing three fields: key, key_format, key_opts.
Example
>>> import io >>> f = io.StringIO() >>> writer = IterativeCSVWriter(f, ["phn"]) >>> print(f.getvalue()) ID,duration,phn,phn_format,phn_opts >>> writer.write("UTT1",2.5,"sil hh ee ll ll oo sil","string","") >>> print(f.getvalue()) ID,duration,phn,phn_format,phn_opts UTT1,2.5,sil hh ee ll ll oo sil,string, >>> writer.write(ID="UTT2",phn="sil ww oo rr ll dd sil",phn_format="string") >>> print(f.getvalue()) ID,duration,phn,phn_format,phn_opts UTT1,2.5,sil hh ee ll ll oo sil,string, UTT2,,sil ww oo rr ll dd sil,string, >>> writer.set_default('phn_format', 'string') >>> writer.write_batch(ID=["UTT3","UTT4"],phn=["ff oo oo", "bb aa rr"]) >>> print(f.getvalue()) ID,duration,phn,phn_format,phn_opts UTT1,2.5,sil hh ee ll ll oo sil,string, UTT2,,sil ww oo rr ll dd sil,string, UTT3,,ff oo oo,string, UTT4,,bb aa rr,string,
- set_default(field, value)[source]¶
Sets a default value for the given CSV field.
- Parameters
field (str) – A field in the CSV.
value – The default value.
- write(*args, **kwargs)[source]¶
Writes one data line into the CSV.
- Parameters
*args – Supply every field with a value in positional form OR.
**kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.
- write_batch(*args, **kwargs)[source]¶
Writes a batch of lines into the CSV.
Here each argument should be a list with the same length.
- Parameters
*args – Supply every field with a value in positional form OR.
**kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.
- speechbrain.dataio.dataio.write_txt_file(data, filename, sampling_rate=None)[source]¶
Write data in text format.
- Parameters
data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.
filename (str) – Path to file where to write the data.
sampling_rate (None) – Not used, just here for interface compatibility.
- Returns
- Return type
Example
>>> tmpdir = getfixture('tmpdir') >>> signal=torch.tensor([1,2,3,4]) >>> write_txt_file(signal, os.path.join(tmpdir, 'example.txt'))
- speechbrain.dataio.dataio.write_stdout(data, filename=None, sampling_rate=None)[source]¶
Write data to standard output.
- Parameters
data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.
filename (None) – Not used, just here for compatibility.
sampling_rate (None) – Not used, just here for compatibility.
- Returns
- Return type
Example
>>> tmpdir = getfixture('tmpdir') >>> signal = torch.tensor([[1,2,3,4]]) >>> write_stdout(signal, tmpdir + '/example.txt') [1, 2, 3, 4]
- speechbrain.dataio.dataio.length_to_mask(length, max_len=None, dtype=None, device=None)[source]¶
Creates a binary mask for each sequence.
Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3
- Parameters
length (torch.LongTensor) – Containing the length of each sequence in the batch. Must be 1D.
max_len (int) – Max length for the mask, also the size of the second dimension.
dtype (torch.dtype, default: None) – The dtype of the generated mask.
device (torch.device, default: None) – The device to put the mask variable.
- Returns
mask – The binary mask.
- Return type
tensor
Example
>>> length=torch.Tensor([1,2,3]) >>> mask=length_to_mask(length) >>> mask tensor([[1., 0., 0.], [1., 1., 0.], [1., 1., 1.]])
- speechbrain.dataio.dataio.read_kaldi_lab(kaldi_ali, kaldi_lab_opts)[source]¶
Read labels in kaldi format.
Uses kaldi IO.
- Parameters
- Returns
lab – A dictionary containing the labels.
- Return type
Note
This depends on kaldi-io-for-python. Install it separately. See: https://github.com/vesis84/kaldi-io-for-python
Example
This example requires kaldi files.
` lab_folder = '/home/kaldi/egs/TIMIT/s5/exp/dnn4_pretrain-dbn_dnn_ali' read_kaldi_lab(lab_folder, 'ali-to-pdf') `
- speechbrain.dataio.dataio.get_md5(file)[source]¶
Get the md5 checksum of an input file.
- Parameters
file (str) – Path to file for which compute the checksum.
- Returns
Checksum for the given filepath.
- Return type
md5
Example
>>> get_md5('samples/audio_samples/example1.wav') 'c482d0081ca35302d30d12f1136c34e5'
- speechbrain.dataio.dataio.save_md5(files, out_file)[source]¶
Saves the md5 of a list of input files as a pickled dict into a file.
- Parameters
- Returns
None
Example
>>> files = [‘samples/audio_samples/example1.wav’]
>>> tmpdir = getfixture(‘tmpdir’)
>>> save_md5(files, os.path.join(tmpdir, “md5.pkl”))
- speechbrain.dataio.dataio.save_pkl(obj, file)[source]¶
Save an object in pkl format.
- Parameters
Example
>>> tmpfile = os.path.join(getfixture('tmpdir'), "example.pkl") >>> save_pkl([1, 2, 3, 4, 5], tmpfile) >>> load_pkl(tmpfile) [1, 2, 3, 4, 5]
- speechbrain.dataio.dataio.load_pkl(file)[source]¶
Loads a pkl file.
For an example, see save_pkl.
- Parameters
file (str) – Path to the input pkl file.
- Returns
- Return type
The loaded object.
- speechbrain.dataio.dataio.prepend_bos_token(label, bos_index)[source]¶
Create labels with <bos> token at the beginning.
- Parameters
label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length].
bos_index (int) – The index for <bos> token.
- Returns
new_label – The new label with <bos> at the beginning.
- Return type
tensor
Example
>>> label=torch.LongTensor([[1,0,0], [2,3,0], [4,5,6]]) >>> new_label=prepend_bos_token(label, bos_index=7) >>> new_label tensor([[7, 1, 0, 0], [7, 2, 3, 0], [7, 4, 5, 6]])
- speechbrain.dataio.dataio.append_eos_token(label, length, eos_index)[source]¶
Create labels with <eos> token appended.
- Parameters
label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length]
length (torch.LongTensor) – Containing the original length of each label sequences. Must be 1D.
eos_index (int) – The index for <eos> token.
- Returns
new_label – The new label with <eos> appended.
- Return type
tensor
Example
>>> label=torch.IntTensor([[1,0,0], [2,3,0], [4,5,6]]) >>> length=torch.LongTensor([1,2,3]) >>> new_label=append_eos_token(label, length, eos_index=7) >>> new_label tensor([[1, 7, 0, 0], [2, 3, 7, 0], [4, 5, 6, 7]], dtype=torch.int32)
- speechbrain.dataio.dataio.merge_char(sequences, space='_')[source]¶
Merge characters sequences into word sequences.
- Parameters
sequences (list) – Each item contains a list, and this list contains a character sequence.
space (string) – The token represents space. Default: _
- Returns
- Return type
The list contains word sequences for each sentence.
Example
>>> sequences = [["a", "b", "_", "c", "_", "d", "e"], ["e", "f", "g", "_", "h", "i"]] >>> results = merge_char(sequences) >>> results [['ab', 'c', 'de'], ['efg', 'hi']]
- speechbrain.dataio.dataio.merge_csvs(data_folder, csv_lst, merged_csv)[source]¶
Merging several csv files into one file.
- Parameters
data_folder (string) – The folder to store csv files to be merged and after merging.
csv_lst (list) – Filenames of csv file to be merged.
merged_csv (string) – The filename to write the merged csv file.
Example
>>> merge_csvs("samples/audio_samples/", ... ["csv_example.csv", "csv_example2.csv"], ... "test_csv_merge.csv")
- speechbrain.dataio.dataio.split_word(sequences, space='_')[source]¶
Split word sequences into character sequences.
- Parameters
sequences (list) – Each item contains a list, and this list contains a words sequence.
space (string) – The token represents space. Default: _
- Returns
- Return type
The list contains word sequences for each sentence.
Example
>>> sequences = [['ab', 'c', 'de'], ['efg', 'hi']] >>> results = split_word(sequences) >>> results [['a', 'b', '_', 'c', '_', 'd', 'e'], ['e', 'f', 'g', '_', 'h', 'i']]