speechbrain.dataio.dataio module

Data reading and writing.

Authors

Mirco Ravanelli 2020
Aku Rouhe 2020
Ju-Chieh Chou 2020
Samuele Cornell 2020
Abdel HEBA 2020
Gaelle Laperriere 2021
Sahar Ghannay 2021
Sylvain de Langen 2022

Summary

Classes:

IterativeCSVWriter

Write CSV files a line at a time.

Functions:

`append_eos_token`	Create labels with <eos> token appended.
`clean_padding`	Sets the value of any padding on the specified tensor to mask_value.
`clean_padding_`	Sets the value of any padding on the specified tensor to mask_value.
`convert_index_to_lab`	Convert a batch of integer IDs to string labels.
`extract_concepts_values`	keep the semantic concepts and values for evaluation.
`get_md5`	Get the md5 checksum of an input file.
`length_to_mask`	Creates a binary mask for each sequence.
`load_data_csv`	Loads CSV and formats string values.
`load_data_json`	Loads JSON and recursively formats string values.
`load_pickle`	Utility function for loading .pkl pickle files.
`load_pkl`	Loads a pkl file.
`merge_char`	Merge characters sequences into word sequences.
`merge_csvs`	Merging several csv files into one file.
`prepend_bos_token`	Create labels with <bos> token at the beginning.
`read_audio`	General audio loading, based on a custom notation.
`read_audio_info`	Retrieves audio metadata from a file path.
`read_audio_multichannel`	General audio loading, based on a custom notation.
`read_kaldi_lab`	Read labels in kaldi format.
`relative_time_to_absolute`	Converts SpeechBrain style relative length to the absolute duration.
`save_md5`	Saves the md5 of a list of input files as a pickled dict into a file.
`save_pkl`	Save an object in pkl format.
`split_word`	Split word sequences into character sequences.
`to_doubleTensor`	param x: Input data to be converted to torch double.
`to_floatTensor`	param x: Input data to be converted to torch float.
`to_longTensor`	param x: Input data to be converted to torch long.
`write_audio`	Write audio on disk.
`write_stdout`	Write data to standard output.
`write_txt_file`	Write data in text format.

Reference

speechbrain.dataio.dataio.load_data_json(json_path, replacements={})[source]

Loads JSON and recursively formats string values.

Parameters:

json_path (str) – Path to CSV file.
replacements (dict) – (Optional dict), e.g., {“data_folder”: “/home/speechbrain/data”}. This is used to recursively format all string values in the data.

Returns:

JSON data with replacements applied.

Return type:

dict

Example

>>> json_spec = '''{
...   "ex1": {"files": ["{ROOT}/mic1/ex1.wav", "{ROOT}/mic2/ex1.wav"], "id": 1},
...   "ex2": {"files": [{"spk1": "{ROOT}/ex2.wav"}, {"spk2": "{ROOT}/ex2.wav"}], "id": 2}
... }
... '''
>>> tmpfile = getfixture('tmpdir') / "test.json"
>>> with open(tmpfile, "w") as fo:
...     _ = fo.write(json_spec)
>>> data = load_data_json(tmpfile, {"ROOT": "/home"})
>>> data["ex1"]["files"][0]
'/home/mic1/ex1.wav'
>>> data["ex2"]["files"][1]["spk2"]
'/home/ex2.wav'

speechbrain.dataio.dataio.load_data_csv(csv_path, replacements={})[source]

Loads CSV and formats string values.

Uses the SpeechBrain legacy CSV data format, where the CSV must have an ‘ID’ field. If there is a field called duration, it is interpreted as a float. The rest of the fields are left as they are (legacy _format and _opts fields are not used to load the data in any special way).

Bash-like string replacements with $to_replace are supported.

Parameters:

csv_path (str) – Path to CSV file.
replacements (dict) – (Optional dict), e.g., {“data_folder”: “/home/speechbrain/data”} This is used to recursively format all string values in the data.

Returns:

CSV data with replacements applied.

Return type:

dict

Example

>>> csv_spec = '''ID,duration,wav_path
... utt1,1.45,$data_folder/utt1.wav
... utt2,2.0,$data_folder/utt2.wav
... '''
>>> tmpfile = getfixture("tmpdir") / "test.csv"
>>> with open(tmpfile, "w") as fo:
...     _ = fo.write(csv_spec)
>>> data = load_data_csv(tmpfile, {"data_folder": "/home"})
>>> data["utt1"]["wav_path"]
'/home/utt1.wav'

speechbrain.dataio.dataio.read_audio_info(path) → AudioMetaData[source]

Retrieves audio metadata from a file path. Behaves identically to torchaudio.info, but attempts to fix metadata (such as frame count) that is otherwise broken with certain torchaudio version and codec combinations.

Note that this may cause full file traversal in certain cases!

Parameters:: path (str) – Path to the audio file to examine.
Returns:: Same value as returned by torchaudio.info, but may eventually have num_frames corrected if it otherwise would have been == 0.
Return type:: torchaudio.backend.common.AudioMetaData

Note

Some codecs, such as MP3, require full file traversal for accurate length information to be retrieved. In these cases, you may as well read the entire audio file to avoid doubling the processing time.

speechbrain.dataio.dataio.read_audio(waveforms_obj)[source]

General audio loading, based on a custom notation.

Expected use case is in conjunction with Datasets specified by JSON.

The parameter may just be a path to a file: read_audio("/path/to/wav1.wav")

Alternatively, you can specify more options in a dict, e.g.: ``` # load a file from sample 8000 through 15999 read_audio({

“file”: “/path/to/wav2.wav”, “start”: 8000, “stop”: 16000

})

Which codecs are supported depends on your torchaudio backend. Refer to torchaudio.load documentation for further details.

param waveforms_obj:

Path to audio or dict with the desired configuration.

Keys for the dict variant: - "file" (str): Path to the audio file. - "start" (int, optional): The first sample to load. If unspecified, load from the very first frame. - "stop" (int, optional): The last sample to load (exclusive). If unspecified or equal to start, load from start to the end. Will not fail if stop is past the sample count of the file and will return less frames.

type waveforms_obj:

str, dict

returns:

1-channel: audio tensor with shape: (samples, ). >=2-channels: audio tensor with shape: (samples, channels).

rtype:

torch.Tensor

Example

>>> dummywav = torch.rand(16000)
>>> import os
>>> tmpfile = str(getfixture('tmpdir') / "wave.wav")
>>> write_audio(tmpfile, dummywav, 16000)
>>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"}
>>> loaded = read_audio(asr_example["wav"])
>>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend
True

speechbrain.dataio.dataio.read_audio_multichannel(waveforms_obj)[source]

General audio loading, based on a custom notation.

Expected use case is in conjunction with Datasets specified by JSON.

The custom notation:

The annotation can be just a path to a file: “/path/to/wav1.wav”

Multiple (possibly multi-channel) files can be specified, as long as they have the same length: {“files”: [

“/path/to/wav1.wav”, “/path/to/wav2.wav” ]

}

Or you can specify a single file more succinctly: {“files”: “/path/to/wav2.wav”}

Offset number samples and stop number samples also can be specified to read only a segment within the files. {“files”: [

“/path/to/wav1.wav”, “/path/to/wav2.wav” ]

“start”: 8000 “stop”: 16000 }

Parameters:: waveforms_obj (str, dict) – Audio reading annotation, see above for format.
Returns:: Audio tensor with shape: (samples, ).
Return type:: torch.Tensor

Example

>>> dummywav = torch.rand(16000, 2)
>>> import os
>>> tmpfile = str(getfixture('tmpdir') / "wave.wav")
>>> write_audio(tmpfile, dummywav, 16000)
>>> asr_example = { "wav": tmpfile, "spk_id": "foo", "words": "foo bar"}
>>> loaded = read_audio(asr_example["wav"])
>>> loaded.allclose(dummywav.squeeze(0),atol=1e-4) # replace with eq with sox_io backend
True

speechbrain.dataio.dataio.write_audio(filepath, audio, samplerate)[source]

Write audio on disk. It is basically a wrapper to support saving audio signals in the speechbrain format (audio, channels).

Parameters:

filepath (path) – Path where to save the audio file.
audio (torch.Tensor) – Audio file in the expected speechbrain format (signal, channels).
samplerate (int) – Sample rate (e.g., 16000).

Example

>>> import os
>>> tmpfile = str(getfixture('tmpdir') / "wave.wav")
>>> dummywav = torch.rand(16000, 2)
>>> write_audio(tmpfile, dummywav, 16000)
>>> loaded = read_audio(tmpfile)
>>> loaded.allclose(dummywav,atol=1e-4) # replace with eq with sox_io backend
True

speechbrain.dataio.dataio.load_pickle(pickle_path)[source]

Utility function for loading .pkl pickle files.

Parameters:: pickle_path (str) – Path to pickle file.
Returns:: out – Python object loaded from pickle.
Return type:: object

speechbrain.dataio.dataio.to_floatTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]

Parameters:: x ((list, tuple, np.ndarray)) – Input data to be converted to torch float.
Returns:: tensor – Data now in torch.tensor float datatype.
Return type:: torch.tensor

speechbrain.dataio.dataio.to_doubleTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]

Parameters:: x ((list, tuple, np.ndarray)) – Input data to be converted to torch double.
Returns:: tensor – Data now in torch.tensor double datatype.
Return type:: torch.tensor

speechbrain.dataio.dataio.to_longTensor(x: (<class 'list'>, <class 'tuple'>, <class 'numpy.ndarray'>))[source]

Parameters:: x ((list, tuple, np.ndarray)) – Input data to be converted to torch long.
Returns:: tensor – Data now in torch.tensor long datatype.
Return type:: torch.tensor

speechbrain.dataio.dataio.convert_index_to_lab(batch, ind2lab)[source]

Convert a batch of integer IDs to string labels.

Parameters:

batch (list) – List of lists, a batch of sequences.
ind2lab (dict) – Mapping from integer IDs to labels.

Returns:

List of lists, same size as batch, with labels from ind2lab.

Return type:

list

Example

>>> ind2lab = {1: "h", 2: "e", 3: "l", 4: "o"}
>>> out = convert_index_to_lab([[4,1], [1,2,3,3,4]], ind2lab)
>>> for seq in out:
...     print("".join(seq))
oh
hello

speechbrain.dataio.dataio.relative_time_to_absolute(batch, relative_lens, rate)[source]

Converts SpeechBrain style relative length to the absolute duration.

Operates on batch level.

Parameters:

batch (torch.tensor) – Sequences to determine the duration for.
relative_lens (torch.tensor) – The relative length of each sequence in batch. The longest sequence in the batch needs to have relative length 1.0.
rate (float) – The rate at which sequence elements occur in real-world time. Sample rate, if batch is raw wavs (recommended) or 1/frame_shift if batch is features. This has to have 1/s as the unit.

Returns:

Duration of each sequence in seconds.

Return type:

torch.tensor

Example

>>> batch = torch.ones(2, 16000)
>>> relative_lens = torch.tensor([3./4., 1.0])
>>> rate = 16000
>>> print(relative_time_to_absolute(batch, relative_lens, rate))
tensor([0.7500, 1.0000])

class speechbrain.dataio.dataio.IterativeCSVWriter(outstream, data_fields, defaults={})[source]

Bases: object

Write CSV files a line at a time.

Parameters:

outstream (file-object) – A writeable stream
data_fields (list) – List of the optional keys to write. Each key will be expanded to the SpeechBrain format, producing three fields: key, key_format, key_opts.

Example

>>> import io
>>> f = io.StringIO()
>>> writer = IterativeCSVWriter(f, ["phn"])
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
>>> writer.write("UTT1",2.5,"sil hh ee ll ll oo sil","string","")
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
>>> writer.write(ID="UTT2",phn="sil ww oo rr ll dd sil",phn_format="string")
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
UTT2,,sil ww oo rr ll dd sil,string,
>>> writer.set_default('phn_format', 'string')
>>> writer.write_batch(ID=["UTT3","UTT4"],phn=["ff oo oo", "bb aa rr"])
>>> print(f.getvalue())
ID,duration,phn,phn_format,phn_opts
UTT1,2.5,sil hh ee ll ll oo sil,string,
UTT2,,sil ww oo rr ll dd sil,string,
UTT3,,ff oo oo,string,
UTT4,,bb aa rr,string,

set_default(field, value)[source]

Sets a default value for the given CSV field.

Parameters:

field (str) – A field in the CSV.
value – The default value.

write(*args, **kwargs)[source]

Writes one data line into the CSV.

Parameters:

*args – Supply every field with a value in positional form OR.
**kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.

write_batch(*args, **kwargs)[source]

Writes a batch of lines into the CSV.

Here each argument should be a list with the same length.

Parameters:

*args – Supply every field with a value in positional form OR.
**kwargs – Supply certain fields by key. The ID field is mandatory for all lines, but others can be left empty.

speechbrain.dataio.dataio.write_txt_file(data, filename, sampling_rate=None)[source]

Write data in text format.

Parameters:

data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.
filename (str) – Path to file where to write the data.
sampling_rate (None) – Not used, just here for interface compatibility.

Return type:

None

Example

>>> tmpdir = getfixture('tmpdir')
>>> signal=torch.tensor([1,2,3,4])
>>> write_txt_file(signal, tmpdir / 'example.txt')

speechbrain.dataio.dataio.write_stdout(data, filename=None, sampling_rate=None)[source]

Write data to standard output.

Parameters:

data (str, list, torch.tensor, numpy.ndarray) – The data to write in the text file.
filename (None) – Not used, just here for compatibility.
sampling_rate (None) – Not used, just here for compatibility.

Return type:

None

Example

>>> tmpdir = getfixture('tmpdir')
>>> signal = torch.tensor([[1,2,3,4]])
>>> write_stdout(signal, tmpdir / 'example.txt')
[1, 2, 3, 4]

speechbrain.dataio.dataio.length_to_mask(length, max_len=None, dtype=None, device=None)[source]

Creates a binary mask for each sequence.

Reference: https://discuss.pytorch.org/t/how-to-generate-variable-length-mask/23397/3

Parameters:

length (torch.LongTensor) – Containing the length of each sequence in the batch. Must be 1D.
max_len (int) – Max length for the mask, also the size of the second dimension.
dtype (torch.dtype, default: None) – The dtype of the generated mask.
device (torch.device, default: None) – The device to put the mask variable.

Returns:

mask – The binary mask.

Return type:

tensor

Example

>>> length=torch.Tensor([1,2,3])
>>> mask=length_to_mask(length)
>>> mask
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

speechbrain.dataio.dataio.read_kaldi_lab(kaldi_ali, kaldi_lab_opts)[source]

Read labels in kaldi format.

Uses kaldi IO.

Parameters:

kaldi_ali (str) – Path to directory where kaldi alignments are stored.
kaldi_lab_opts (str) – A string that contains the options for reading the kaldi alignments.

Returns:

lab – A dictionary containing the labels.

Return type:

dict

Note

This depends on kaldi-io-for-python. Install it separately. See: https://github.com/vesis84/kaldi-io-for-python

Example

This example requires kaldi files. ` lab_folder = '/home/kaldi/egs/TIMIT/s5/exp/dnn4_pretrain-dbn_dnn_ali' read_kaldi_lab(lab_folder, 'ali-to-pdf') `

speechbrain.dataio.dataio.get_md5(file)[source]

Get the md5 checksum of an input file.

Parameters:: file (str) – Path to file for which compute the checksum.
Returns:: Checksum for the given filepath.
Return type:: md5

Example

>>> get_md5('tests/samples/single-mic/example1.wav')
'c482d0081ca35302d30d12f1136c34e5'

speechbrain.dataio.dataio.save_md5(files, out_file)[source]

Saves the md5 of a list of input files as a pickled dict into a file.

Parameters:

files (list) – List of input files from which we will compute the md5.
outfile (str) – The path where to store the output pkl file.

Returns:

None
Example
>>> files = [‘tests/samples/single-mic/example1.wav’]
>>> tmpdir = getfixture(‘tmpdir’)
>>> save_md5(files, tmpdir / “md5.pkl”)

speechbrain.dataio.dataio.save_pkl(obj, file)[source]

Save an object in pkl format.

Parameters:

obj (object) – Object to save in pkl format
file (str) – Path to the output file
sampling_rate (int) – Sampling rate of the audio file, TODO: this is not used?

Example

>>> tmpfile = getfixture('tmpdir') / "example.pkl"
>>> save_pkl([1, 2, 3, 4, 5], tmpfile)
>>> load_pkl(tmpfile)
[1, 2, 3, 4, 5]

speechbrain.dataio.dataio.load_pkl(file)[source]

Loads a pkl file.

For an example, see save_pkl.

Parameters:: file (str) – Path to the input pkl file.
Return type:: The loaded object.

speechbrain.dataio.dataio.prepend_bos_token(label, bos_index)[source]

Create labels with <bos> token at the beginning.

Parameters:

label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length].
bos_index (int) – The index for <bos> token.

Returns:

new_label – The new label with <bos> at the beginning.

Return type:

tensor

Example

>>> label=torch.LongTensor([[1,0,0], [2,3,0], [4,5,6]])
>>> new_label=prepend_bos_token(label, bos_index=7)
>>> new_label
tensor([[7, 1, 0, 0],
        [7, 2, 3, 0],
        [7, 4, 5, 6]])

speechbrain.dataio.dataio.append_eos_token(label, length, eos_index)[source]

Create labels with <eos> token appended.

Parameters:

label (torch.IntTensor) – Containing the original labels. Must be of size: [batch_size, max_length]
length (torch.LongTensor) – Containing the original length of each label sequences. Must be 1D.
eos_index (int) – The index for <eos> token.

Returns:

new_label – The new label with <eos> appended.

Return type:

tensor

Example

>>> label=torch.IntTensor([[1,0,0], [2,3,0], [4,5,6]])
>>> length=torch.LongTensor([1,2,3])
>>> new_label=append_eos_token(label, length, eos_index=7)
>>> new_label
tensor([[1, 7, 0, 0],
        [2, 3, 7, 0],
        [4, 5, 6, 7]], dtype=torch.int32)

speechbrain.dataio.dataio.merge_char(sequences, space='_')[source]

Merge characters sequences into word sequences.

Parameters:

sequences (list) – Each item contains a list, and this list contains a character sequence.
space (string) – The token represents space. Default: _

Return type:

The list contains word sequences for each sentence.

Example

>>> sequences = [["a", "b", "_", "c", "_", "d", "e"], ["e", "f", "g", "_", "h", "i"]]
>>> results = merge_char(sequences)
>>> results
[['ab', 'c', 'de'], ['efg', 'hi']]

speechbrain.dataio.dataio.merge_csvs(data_folder, csv_lst, merged_csv)[source]

Merging several csv files into one file.

Parameters:

data_folder (string) – The folder to store csv files to be merged and after merging.
csv_lst (list) – Filenames of csv file to be merged.
merged_csv (string) – The filename to write the merged csv file.

Example

>>> tmpdir = getfixture('tmpdir')
>>> os.symlink(os.path.realpath("tests/samples/annotation/speech.csv"), tmpdir / "speech.csv")
>>> merge_csvs(tmpdir,
... ["speech.csv", "speech.csv"],
... "test_csv_merge.csv")

speechbrain.dataio.dataio.split_word(sequences, space='_')[source]

Split word sequences into character sequences.

Parameters:

sequences (list) – Each item contains a list, and this list contains a words sequence.
space (string) – The token represents space. Default: _

Return type:

The list contains word sequences for each sentence.

Example

>>> sequences = [['ab', 'c', 'de'], ['efg', 'hi']]
>>> results = split_word(sequences)
>>> results
[['a', 'b', '_', 'c', '_', 'd', 'e'], ['e', 'f', 'g', '_', 'h', 'i']]

speechbrain.dataio.dataio.clean_padding_(tensor, length, len_dim=1, mask_value=0.0)[source]

Sets the value of any padding on the specified tensor to mask_value.

For instance, this can be used to zero out the outputs of an autoencoder during training past the specified length.

This is an in-place operation

Parameters:

tensor (torch.Tensor) – a tensor of arbitrary dimension
length (torch.Tensor) – a 1-D tensor of lengths
len_dim (int) – the dimension representing the length
mask_value (mixed) – the value to be assigned to padding positions

Example

>>> import torch
>>> x = torch.arange(5).unsqueeze(0).repeat(3, 1)
>>> x = x + torch.arange(3).unsqueeze(-1)
>>> x
tensor([[0, 1, 2, 3, 4],
        [1, 2, 3, 4, 5],
        [2, 3, 4, 5, 6]])
>>> length = torch.tensor([0.4, 1.0, 0.6])
>>> clean_padding_(x, length=length, mask_value=10.)
>>> x
tensor([[ 0,  1, 10, 10, 10],
        [ 1,  2,  3,  4,  5],
        [ 2,  3,  4, 10, 10]])
>>> x = torch.arange(5)[None, :, None].repeat(3, 1, 2)
>>> x = x + torch.arange(3)[:, None, None]
>>> x = x * torch.arange(1, 3)[None, None, :]
>>> x = x.transpose(1, 2)
>>> x
tensor([[[ 0,  1,  2,  3,  4],
         [ 0,  2,  4,  6,  8]],

        [[ 1,  2,  3,  4,  5],
         [ 2,  4,  6,  8, 10]],

        [[ 2,  3,  4,  5,  6],
         [ 4,  6,  8, 10, 12]]])
>>> clean_padding_(x, length=length, mask_value=10., len_dim=2)
>>> x
tensor([[[ 0,  1, 10, 10, 10],
         [ 0,  2, 10, 10, 10]],

        [[ 1,  2,  3,  4,  5],
         [ 2,  4,  6,  8, 10]],

        [[ 2,  3,  4, 10, 10],
         [ 4,  6,  8, 10, 10]]])

speechbrain.dataio.dataio.clean_padding(tensor, length, len_dim=1, mask_value=0.0)[source]

Sets the value of any padding on the specified tensor to mask_value.

For instance, this can be used to zero out the outputs of an autoencoder during training past the specified length.

This version of the operation does not modify the original tensor

Parameters:

tensor (torch.Tensor) – a tensor of arbitrary dimension
length (torch.Tensor) – a 1-D tensor of lengths
len_dim (int) – the dimension representing the length
mask_value (mixed) – the value to be assigned to padding positions

Example

>>> import torch
>>> x = torch.arange(5).unsqueeze(0).repeat(3, 1)
>>> x = x + torch.arange(3).unsqueeze(-1)
>>> x
tensor([[0, 1, 2, 3, 4],
        [1, 2, 3, 4, 5],
        [2, 3, 4, 5, 6]])
>>> length = torch.tensor([0.4, 1.0, 0.6])
>>> x_p = clean_padding(x, length=length, mask_value=10.)
>>> x_p
tensor([[ 0,  1, 10, 10, 10],
        [ 1,  2,  3,  4,  5],
        [ 2,  3,  4, 10, 10]])
>>> x = torch.arange(5)[None, :, None].repeat(3, 1, 2)
>>> x = x + torch.arange(3)[:, None, None]
>>> x = x * torch.arange(1, 3)[None, None, :]
>>> x = x.transpose(1, 2)
>>> x
tensor([[[ 0,  1,  2,  3,  4],
         [ 0,  2,  4,  6,  8]],

        [[ 1,  2,  3,  4,  5],
         [ 2,  4,  6,  8, 10]],

        [[ 2,  3,  4,  5,  6],
         [ 4,  6,  8, 10, 12]]])
>>> x_p = clean_padding(x, length=length, mask_value=10., len_dim=2)
>>> x_p
tensor([[[ 0,  1, 10, 10, 10],
         [ 0,  2, 10, 10, 10]],

        [[ 1,  2,  3,  4,  5],
         [ 2,  4,  6,  8, 10]],

        [[ 2,  3,  4, 10, 10],
         [ 4,  6,  8, 10, 10]]])

speechbrain.dataio.dataio.extract_concepts_values(sequences, keep_values, tag_in, tag_out, space)[source]

keep the semantic concepts and values for evaluation.

Parameters:

sequences (list) – Each item contains a list, and this list contains a character sequence.
keep_values (bool) – If True, keep the values. If not don’t.
tag_in (char) – Indicates the start of the concept.
tag_out (char) – Indicates the end of the concept.
space (string) – The token represents space. Default: _

Return type:

The list contains concept and value sequences for each sentence.

Example

>>> sequences = [['<reponse>','_','n','o','_','>','_','<localisation-ville>','_','L','e','_','M','a','n','s','_','>'], ['<reponse>','_','s','i','_','>'],['v','a','_','b','e','n','e']]
>>> results = extract_concepts_values(sequences, True, '<', '>', '_')
>>> results
[['<reponse> no', '<localisation-ville> Le Mans'], ['<reponse> si'], ['']]