speechbrain.utils.edit_distance module

Edit distance and WER computation.

Authors

Aku Rouhe 2020

Summary

Functions:

`accumulatable_wer_stats`	Computes word error rate and the related counts for a batch.
`alignment`	Get the edit distance alignment from an edit op table.
`count_ops`	Count the edit operations in the shortest edit path in edit op table.
`op_table`	Table of edit operations between a and b.
`top_wer_spks`	Finds the K speakers with the highest word error rates.
`top_wer_utts`	Finds the k utterances with highest word error rates.
`wer_details_by_speaker`	Compute word error rate and another salient info grouping by speakers.
`wer_details_by_utterance`	Computes a wealth WER info about each single utterance.
`wer_details_for_batch`	Convenient batch interface for `wer_details_by_utterance`.
`wer_summary`	Computes summary stats from the output of details_by_utterance

Reference

speechbrain.utils.edit_distance.accumulatable_wer_stats(refs, hyps, stats={})[source]

Computes word error rate and the related counts for a batch.

Can also be used to accumulate the counts over many batches, by passing the output back to the function in the call for the next batch.

Parameters

ref (iterable) – Batch of reference sequences.
hyp (iterable) – Batch of hypothesis sequences.
stats (collections.Counter) – The running statistics. Pass the output of this function back as this parameter to accumulate the counts. It may be cleanest to initialize the stats yourself; then an empty collections.Counter() should be used.

Returns

The updated running statistics, with keys:

”WER” - word error rate
”insertions” - number of insertions
”deletions” - number of deletions
”substitutions” - number of substitutions
”num_ref_tokens” - number of reference tokens

Return type

collections.Counter

Example

>>> import collections
>>> batches = [[[[1,2,3],[4,5,6]], [[1,2,4],[5,6]]],
...             [[[7,8], [9]],     [[7,8],  [10]]]]
>>> stats = collections.Counter()
>>> for batch in batches:
...     refs, hyps = batch
...     stats = accumulatable_wer_stats(refs, hyps, stats)
>>> print("%WER {WER:.2f}, {num_ref_tokens} ref tokens".format(**stats))
%WER 33.33, 9 ref tokens

speechbrain.utils.edit_distance.op_table(a, b)[source]

Table of edit operations between a and b.

Solves for the table of edit operations, which is mainly used to compute word error rate. The table is of size [|a|+1, |b|+1], and each point (i, j) in the table has an edit operation. The edit operations can be deterministically followed backwards to find the shortest edit path to from a[:i-1] to b[:j-1]. Indexes of zero (i=0 or j=0) correspond to an empty sequence.

The algorithm itself is well known, see

Levenshtein distance

Note that in some cases there are multiple valid edit operation paths which lead to the same edit distance minimum.

Parameters

a (iterable) – Sequence for which the edit operations are solved.
b (iterable) – Sequence for which the edit operations are solved.

Returns

List of lists, Matrix, Table of edit operations.

Return type

list

Example

>>> ref = [1,2,3]
>>> hyp = [1,2,4]
>>> for row in op_table(ref, hyp):
...     print(row)
['=', 'I', 'I', 'I']
['D', '=', 'I', 'I']
['D', 'D', '=', 'I']
['D', 'D', 'D', 'S']

speechbrain.utils.edit_distance.alignment(table)[source]

Get the edit distance alignment from an edit op table.

Walks back an edit operations table, produced by calling table(a, b), and collects the edit distance alignment of a to b. The alignment shows which token in a corresponds to which token in b. Note that the alignment is monotonic, one-to-zero-or-one.

Parameters: table (list) – Edit operations table from op_table(a, b).
Returns: Schema: [(str <edit-op>, int-or-None <i>, int-or-None <j>),] List of edit operations, and the corresponding indices to a and b. See the EDIT_SYMBOLS dict for the edit-ops. The i indexes a, j indexes b, and the indices can be None, which means aligning to nothing.
Return type: list

Example

>>> # table for a=[1,2,3], b=[1,2,4]:
>>> table = [['I', 'I', 'I', 'I'],
...          ['D', '=', 'I', 'I'],
...          ['D', 'D', '=', 'I'],
...          ['D', 'D', 'D', 'S']]
>>> print(alignment(table))
[('=', 0, 0), ('=', 1, 1), ('S', 2, 2)]

speechbrain.utils.edit_distance.count_ops(table)[source]

Count the edit operations in the shortest edit path in edit op table.

Walks back an edit operations table produced by table(a, b) and counts the number of insertions, deletions, and substitutions in the shortest edit path. This information is typically used in speech recognition to report the number of different error types separately.

Parameters

table (list) – Edit operations table from op_table(a, b).

Returns

The counts of the edit operations, with keys:

”insertions”
”deletions”
”substitutions”

NOTE: not all of the keys might appear explicitly in the output, but for the missing keys collections. The counter will return 0.

Return type

collections.Counter

Example

>>> table = [['I', 'I', 'I', 'I'],
...          ['D', '=', 'I', 'I'],
...          ['D', 'D', '=', 'I'],
...          ['D', 'D', 'D', 'S']]
>>> print(count_ops(table))
Counter({'substitutions': 1})

speechbrain.utils.edit_distance.wer_details_for_batch(ids, refs, hyps, compute_alignments=False)[source]

Convenient batch interface for wer_details_by_utterance.

wer_details_by_utterance can handle missing hypotheses, but sometimes (e.g. CTC training with greedy decoding) they are not needed, and this is a convenient interface in that case.

Parameters

ids (list, torch.tensor) – Utterance ids for the batch.
refs (list, torch.tensor) – Reference sequences.
hyps (list, torch.tensor) – Hypothesis sequences.
compute_alignments (bool, optional) – Whether to compute alignments or not. If computed, the details will also store the refs and hyps. (default: False)

Returns

See wer_details_by_utterance

Return type

list

Example

>>> ids = [['utt1'], ['utt2']]
>>> refs = [[['a','b','c']], [['d','e']]]
>>> hyps = [[['a','b','d']], [['d','e']]]
>>> wer_details = []
>>> for ids_batch, refs_batch, hyps_batch in zip(ids, refs, hyps):
...     details = wer_details_for_batch(ids_batch, refs_batch, hyps_batch)
...     wer_details.extend(details)
>>> print(wer_details[0]['key'], ":",
...     "{:.2f}".format(wer_details[0]['WER']))
utt1 : 33.33

speechbrain.utils.edit_distance.wer_details_by_utterance(ref_dict, hyp_dict, compute_alignments=False, scoring_mode='strict')[source]

Computes a wealth WER info about each single utterance.

This info can then be used to compute summary details (WER, SER).

Parameters

ref_dict (dict) – Should be indexable by utterance ids, and return the reference tokens for each utterance id as iterable
hyp_dict (dict) – Should be indexable by utterance ids, and return the hypothesis tokens for each utterance id as iterable
compute_alignments (bool) – Whether alignments should also be saved. This also saves the tokens themselves, as they are probably required for printing the alignments.
scoring_mode ({'strict', 'all', 'present'}) –
How to deal with missing hypotheses (reference utterance id not found in hyp_dict).
- ’strict’: Raise error for missing hypotheses.
- ’all’: Score missing hypotheses as empty.
- ’present’: Only score existing hypotheses.

Returns

A list with one entry for every reference utterance. Each entry is a dict with keys:

”key”: utterance id
”scored”: (bool) Whether utterance was scored.
”hyp_absent”: (bool) True if a hypothesis was NOT found.
”hyp_empty”: (bool) True if hypothesis was considered empty (either because it was empty, or not found and mode ‘all’).
”num_edits”: (int) Number of edits in total.
”num_ref_tokens”: (int) Number of tokens in the reference.
”WER”: (float) Word error rate of the utterance.
”insertions”: (int) Number of insertions.
”deletions”: (int) Number of deletions.
”substitutions”: (int) Number of substitutions.
”alignment”: If compute_alignments is True, alignment as list, see speechbrain.utils.edit_distance.alignment. If compute_alignments is False, this is None.
”ref_tokens”: (iterable) The reference tokens only saved if alignments were computed, else None.
”hyp_tokens”: (iterable) the hypothesis tokens, only saved if alignments were computed, else None.

Return type

list

Raises

KeyError – If scoring mode is ‘strict’ and a hypothesis is not found.

speechbrain.utils.edit_distance.wer_summary(details_by_utterance)[source]

Computes summary stats from the output of details_by_utterance

Summary stats like WER

Parameters

details_by_utterance (list) – See the output of wer_details_by_utterance

Returns

Dictionary with keys:

”WER”: (float) Word Error Rate.
”SER”: (float) Sentence Error Rate (percentage of utterances which had at least one error).
”num_edits”: (int) Total number of edits.
”num_scored_tokens”: (int) Total number of tokens in scored reference utterances (a missing hypothesis might still have been scored with ‘all’ scoring mode).
”num_erraneous_sents”: (int) Total number of utterances which had at least one error.
”num_scored_sents”: (int) Total number of utterances which were scored.
”num_absent_sents”: (int) Hypotheses which were not found.
”num_ref_sents”: (int) Number of all reference utterances.
”insertions”: (int) Total number of insertions.
”deletions”: (int) Total number of deletions.
”substitutions”: (int) Total number of substitutions.

NOTE: Some cases lead to ambiguity over number of insertions, deletions and substitutions. We aim to replicate Kaldi compute_wer numbers.

Return type

dict

speechbrain.utils.edit_distance.wer_details_by_speaker(details_by_utterance, utt2spk)[source]

Compute word error rate and another salient info grouping by speakers.

Parameters

details_by_utterance (list) – See the output of wer_details_by_utterance
utt2spk (dict) – Map from utterance id to speaker id

Returns

Maps speaker id to a dictionary of the statistics, with keys:

”speaker”: Speaker id,
”num_edits”: (int) Number of edits in total by this speaker.
”insertions”: (int) Number insertions by this speaker.
”dels”: (int) Number of deletions by this speaker.
”subs”: (int) Number of substitutions by this speaker.
”num_scored_tokens”: (int) Number of scored reference tokens by this speaker (a missing hypothesis might still have been scored with ‘all’ scoring mode).
”num_scored_sents”: (int) number of scored utterances by this speaker.
”num_erraneous_sents”: (int) number of utterance with at least one error, by this speaker.
”num_absent_sents”: (int) number of utterances for which no hypotheses was found, by this speaker.
”num_ref_sents”: (int) number of utterances by this speaker in total.

Return type

dict

speechbrain.utils.edit_distance.top_wer_utts(details_by_utterance, top_k=20)[source]

Finds the k utterances with highest word error rates.

Useful for diagnostic purposes, to see where the system is making the most mistakes. Returns results utterances which were not empty i.e. had to have been present in the hypotheses, with output produced

Parameters

details_by_utterance (list) – See output of wer_details_by_utterance.
top_k (int) – Number of utterances to return.

Returns

List of at most K utterances, with the highest word error rates, which were not empty. The utterance dict has the same keys as details_by_utterance.

Return type

list

speechbrain.utils.edit_distance.top_wer_spks(details_by_speaker, top_k=10)[source]

Finds the K speakers with the highest word error rates.

Useful for diagnostic purposes.

Parameters

details_by_speaker (list) – See output of wer_details_by_speaker.
top_k (int) – Number of seakers to return.

Returns

List of at most K dicts (with the same keys as details_by_speaker) of speakers sorted by WER.

Return type

list