speechbrain.alignment.aligner module
Alignment code
- Authors
Elena Rastorgueva 2020
Loren Lugosch 2020
Summary
Classes:
This class calculates Viterbi alignments in the forward method. |
Functions:
For each 'matrix' and 'vector' pair in the batch, do matrix-vector multiplication in the log domain, i.e., logsumexp instead of add, add instead of multiply. |
|
Similar to batch_log_matvecmul, but takes a maximum instead of logsumexp. |
|
Converts 2 lists containing indices for phonemes from different phoneme sets to a single phoneme so that comparing the equality of the indices of the resulting lists will yield the correct accuracy. |
Reference
- class speechbrain.alignment.aligner.HMMAligner(states_per_phoneme=1, output_folder='', neg_inf=-100000.0, batch_reduction='none', input_len_norm=False, target_len_norm=False, lexicon_path=None)[source]
Bases:
Module
This class calculates Viterbi alignments in the forward method.
It also records alignments and creates batches of them for use in Viterbi training.
- Parameters:
states_per_phoneme (int) – Number of hidden states to use per phoneme.
output_folder (str) – It is the folder that the alignments will be stored in when saved to disk. Not yet implemented.
neg_inf (float) – The float used to represent a negative infinite log probability. Using
-float("Inf")
tends to give numerical instability. A number more negative than -1e5 also sometimes gave errors when thegenbmm
library was used (currently not in use). (default: -1e5)batch_reduction (string) – One of “none”, “sum” or “mean”. What kind of batch-level reduction to apply to the loss calculated in the forward method.
input_len_norm (bool) – Whether to normalize the loss in the forward method by the length of the inputs.
target_len_norm (bool) – Whether to normalize the loss in the forward method by the length of the targets.
lexicon_path (string) – The location of the lexicon.
Example
>>> log_posteriors = torch.tensor([[[ -1., -10., -10.], ... [-10., -1., -10.], ... [-10., -10., -1.]], ... ... [[ -1., -10., -10.], ... [-10., -1., -10.], ... [-10., -10., -10.]]]) >>> lens = torch.tensor([1., 0.66]) >>> phns = torch.tensor([[0, 1, 2], ... [0, 1, 0]]) >>> phn_lens = torch.tensor([1., 0.66]) >>> aligner = HMMAligner() >>> forward_scores = aligner( ... log_posteriors, lens, phns, phn_lens, 'forward' ... ) >>> forward_scores.shape torch.Size([2]) >>> viterbi_scores, alignments = aligner( ... log_posteriors, lens, phns, phn_lens, 'viterbi' ... ) >>> alignments [[0, 1, 2], [0, 1]] >>> viterbi_scores.shape torch.Size([2])
- use_lexicon(words, interword_sils=True, sample_pron=False)[source]
Do processing using the lexicon to return a sequence of the possible phonemes, the transition/pi probabilities, and the possible final states. Does processing on an utterance-by-utterance basis. Each utterance in the batch is processed by a helper method
_use_lexicon
.- Parameters:
words (list) – List of the words in the transcript
interword_sils (bool) – If True, optional silences will be inserted between every word. If False, optional silences will only be placed at the beginning and end of each utterance.
sample_pron (bool) – If True, it will sample a single possible sequence of phonemes. If False, it will return statistics for all possible sequences of phonemes.
- Returns:
poss_phns (torch.Tensor (batch, phoneme in possible phn sequence)) – The phonemes that are thought to be in each utterance.
poss_phn_lens (torch.Tensor (batch)) – The relative length of each possible phoneme sequence in the batch.
trans_prob (torch.Tensor (batch, from, to)) – Tensor containing transition (log) probabilities.
pi_prob (torch.Tensor (batch, state)) – Tensor containing initial (log) probabilities.
final_state (list of lists of ints) – A list of lists of possible final states for each utterance.
Example
>>> aligner = HMMAligner() >>> aligner.lexicon = { ... "a": {0: "a"}, ... "b": {0: "b", 1: "c"} ... } >>> words = [["a", "b"]] >>> aligner.lex_lab2ind = { ... "sil": 0, ... "a": 1, ... "b": 2, ... "c": 3, ... } >>> poss_phns, poss_phn_lens, trans_prob, pi_prob, final_states = aligner.use_lexicon( ... words, ... interword_sils = True ... ) >>> poss_phns tensor([[0, 1, 0, 2, 3, 0]]) >>> poss_phn_lens tensor([1.]) >>> trans_prob tensor([[[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05], [-1.0000e+05, -1.3863e+00, -1.3863e+00, -1.3863e+00, -1.3863e+00, -1.0000e+05], [-1.0000e+05, -1.0000e+05, -1.0986e+00, -1.0986e+00, -1.0986e+00, -1.0000e+05], [-1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -1.0000e+05, -6.9315e-01], [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01], [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05, 0.0000e+00]]]) >>> pi_prob tensor([[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05]]) >>> final_states [[3, 4, 5]] >>> # With no optional silences between words >>> poss_phns_, _, trans_prob_, pi_prob_, final_states_ = aligner.use_lexicon( ... words, ... interword_sils = False ... ) >>> poss_phns_ tensor([[0, 1, 2, 3, 0]]) >>> trans_prob_ tensor([[[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05], [-1.0000e+05, -1.0986e+00, -1.0986e+00, -1.0986e+00, -1.0000e+05], [-1.0000e+05, -1.0000e+05, -6.9315e-01, -1.0000e+05, -6.9315e-01], [-1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01], [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05, 0.0000e+00]]]) >>> pi_prob_ tensor([[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05]]) >>> final_states_ [[2, 3, 4]] >>> # With sampling of a single possible pronunciation >>> import random >>> random.seed(0) >>> poss_phns_, _, trans_prob_, pi_prob_, final_states_ = aligner.use_lexicon( ... words, ... sample_pron = True ... ) >>> poss_phns_ tensor([[0, 1, 0, 2, 0]]) >>> trans_prob_ tensor([[[-6.9315e-01, -6.9315e-01, -1.0000e+05, -1.0000e+05, -1.0000e+05], [-1.0000e+05, -1.0986e+00, -1.0986e+00, -1.0986e+00, -1.0000e+05], [-1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01, -1.0000e+05], [-1.0000e+05, -1.0000e+05, -1.0000e+05, -6.9315e-01, -6.9315e-01], [-1.0000e+05, -1.0000e+05, -1.0000e+05, -1.0000e+05, 0.0000e+00]]])
- forward(emission_pred, lens, phns, phn_lens, dp_algorithm, prob_matrices=None)[source]
Prepares relevant (log) probability tensors and does dynamic programming: either the forward or the Viterbi algorithm. Applies reduction as specified during object initialization.
- Parameters:
emission_pred (torch.Tensor (batch, time, phoneme in vocabulary)) – Posterior probabilities from our acoustic model.
lens (torch.Tensor (batch)) – The relative duration of each utterance sound file.
phns (torch.Tensor (batch, phoneme in phn sequence)) – The phonemes that are known/thought to be in each utterance
phn_lens (torch.Tensor (batch)) – The relative length of each phoneme sequence in the batch.
dp_algorithm (string) – Either “forward” or “viterbi”.
prob_matrices (dict) – (Optional) Must contain keys ‘trans_prob’, ‘pi_prob’ and ‘final_states’. Used to override the default forward and viterbi operations which force traversal over all of the states in the
phns
sequence.
- Returns:
if dp_algorithm == “forward”.
forward_scores
: torch.Tensor (batch, or scalar)The (log) likelihood of each utterance in the batch, with reduction applied if specified. (OR)
if dp_algorithm == “viterbi”.
viterbi_scores
: torch.Tensor (batch, or scalar)The (log) likelihood of the Viterbi path for each utterance, with reduction applied if specified.
alignments
: list of lists of intViterbi alignments for the files in the batch.
- Return type:
tensor
- expand_phns_by_states_per_phoneme(phns, phn_lens)[source]
Expands each phoneme in the phn sequence by the number of hidden states per phoneme defined in the HMM.
- Parameters:
phns (torch.Tensor (batch, phoneme in phn sequence)) – The phonemes that are known/thought to be in each utterance.
phn_lens (torch.Tensor (batch)) – The relative length of each phoneme sequence in the batch.
- Returns:
expanded_phns
- Return type:
torch.Tensor (batch, phoneme in expanded phn sequence)
Example
>>> phns = torch.tensor([[0., 3., 5., 0.], ... [0., 2., 0., 0.]]) >>> phn_lens = torch.tensor([1., 0.75]) >>> aligner = HMMAligner(states_per_phoneme = 3) >>> expanded_phns = aligner.expand_phns_by_states_per_phoneme( ... phns, phn_lens ... ) >>> expanded_phns tensor([[ 0., 1., 2., 9., 10., 11., 15., 16., 17., 0., 1., 2.], [ 0., 1., 2., 6., 7., 8., 0., 1., 2., 0., 0., 0.]])
- store_alignments(ids, alignments)[source]
Records Viterbi alignments in
self.align_dict
.- Parameters:
Example
>>> aligner = HMMAligner() >>> ids = ['id1', 'id2'] >>> alignments = [[0, 2, 4], [1, 2, 3, 4]] >>> aligner.store_alignments(ids, alignments) >>> aligner.align_dict.keys() dict_keys(['id1', 'id2']) >>> aligner.align_dict['id1'] tensor([0, 2, 4], dtype=torch.int16)
- get_prev_alignments(ids, emission_pred, lens, phns, phn_lens)[source]
Fetches previously recorded Viterbi alignments if they are available. If not, fetches flat start alignments. Currently, assumes that if a Viterbi alignment is not available for the first utterance in the batch, it will not be available for the rest of the utterances.
- Parameters:
emission_pred (torch.Tensor (batch, time, phoneme in vocabulary)) – Posterior probabilities from our acoustic model. Used to infer the duration of the longest utterance in the batch.
lens (torch.Tensor (batch)) – The relative duration of each utterance sound file.
phns (torch.Tensor (batch, phoneme in phn sequence)) – The phonemes that are known/thought to be in each utterance.
phn_lens (torch.Tensor (batch)) – The relative length of each phoneme sequence in the batch.
- Returns:
Zero-padded alignments.
- Return type:
torch.Tensor (batch, time)
Example
>>> ids = ['id1', 'id2'] >>> emission_pred = torch.tensor([[[ -1., -10., -10.], ... [-10., -1., -10.], ... [-10., -10., -1.]], ... ... [[ -1., -10., -10.], ... [-10., -1., -10.], ... [-10., -10., -10.]]]) >>> lens = torch.tensor([1., 0.66]) >>> phns = torch.tensor([[0, 1, 2], ... [0, 1, 0]]) >>> phn_lens = torch.tensor([1., 0.66]) >>> aligner = HMMAligner() >>> alignment_batch = aligner.get_prev_alignments( ... ids, emission_pred, lens, phns, phn_lens ... ) >>> alignment_batch tensor([[0, 1, 2], [0, 1, 0]])
- calc_accuracy(alignments, ends, phns, ind2labs=None)[source]
Calculates mean accuracy between predicted alignments and ground truth alignments. Ground truth alignments are derived from ground truth phns and their ends in the audio sample.
- Parameters:
alignments (list of lists of ints/floats) – The predicted alignments for each utterance in the batch.
ends (list of lists of ints) – A list of lists of sample indices where each ground truth phoneme ends, according to the transcription. Note: current implementation assumes that ‘ends’ mark the index where the next phoneme begins.
phns (list of lists of ints/floats) – The unpadded list of lists of ground truth phonemes in the batch.
ind2labs (tuple) – (Optional) Contains the original index-to-label dicts for the first and second sequence of phonemes.
- Returns:
mean_acc – The mean percentage of times that the upsampled predicted alignment matches the ground truth alignment.
- Return type:
Example
>>> aligner = HMMAligner() >>> alignments = [[0., 0., 0., 1.]] >>> phns = [[0., 1.]] >>> ends = [[2, 4]] >>> mean_acc = aligner.calc_accuracy(alignments, ends, phns) >>> mean_acc.item() 75.0
- collapse_alignments(alignments)[source]
Converts alignments to 1 state per phoneme style.
- Parameters:
alignments (list of ints) – Predicted alignments for a single utterance.
- Returns:
sequence – The predicted alignments converted to a 1 state per phoneme style.
- Return type:
list of ints
Example
>>> aligner = HMMAligner(states_per_phoneme = 3) >>> alignments = [0, 1, 2, 3, 4, 5, 3, 4, 5, 0, 1, 2] >>> sequence = aligner.collapse_alignments(alignments) >>> sequence [0, 1, 1, 0]
- speechbrain.alignment.aligner.map_inds_to_intersect(lists1, lists2, ind2labs)[source]
Converts 2 lists containing indices for phonemes from different phoneme sets to a single phoneme so that comparing the equality of the indices of the resulting lists will yield the correct accuracy.
- Parameters:
lists1 (list of lists of ints) – Contains the indices of the first sequence of phonemes.
lists2 (list of lists of ints) – Contains the indices of the second sequence of phonemes.
ind2labs (tuple (dict, dict)) – Contains the original index-to-label dicts for the first and second sequence of phonemes.
- Returns:
lists1_new (list of lists of ints) – Contains the indices of the first sequence of phonemes, mapped to the new phoneme set.
lists2_new (list of lists of ints) – Contains the indices of the second sequence of phonemes, mapped to the new phoneme set.
Example
>>> lists1 = [[0, 1]] >>> lists2 = [[0, 1]] >>> ind2lab1 = { ... 0: "a", ... 1: "b", ... } >>> ind2lab2 = { ... 0: "a", ... 1: "c", ... } >>> ind2labs = (ind2lab1, ind2lab2) >>> out1, out2 = map_inds_to_intersect(lists1, lists2, ind2labs) >>> out1 [[0, 1]] >>> out2 [[0, 2]]
- speechbrain.alignment.aligner.batch_log_matvecmul(A, b)[source]
For each ‘matrix’ and ‘vector’ pair in the batch, do matrix-vector multiplication in the log domain, i.e., logsumexp instead of add, add instead of multiply.
- Parameters:
A (torch.Tensor (batch, dim1, dim2)) – Tensor
b (torch.Tensor (batch, dim1)) – Tensor.
- Returns:
x
- Return type:
torch.Tensor (batch, dim1)
Example
>>> A = torch.tensor([[[ 0., 0.], ... [ -1e5, 0.]]]) >>> b = torch.tensor([[0., 0.,]]) >>> x = batch_log_matvecmul(A, b) >>> x tensor([[0.6931, 0.0000]]) >>> >>> # non-log domain equivalent without batching functionality >>> A_ = torch.tensor([[1., 1.], ... [0., 1.]]) >>> b_ = torch.tensor([1., 1.,]) >>> x_ = torch.matmul(A_, b_) >>> x_ tensor([2., 1.])
- speechbrain.alignment.aligner.batch_log_maxvecmul(A, b)[source]
Similar to batch_log_matvecmul, but takes a maximum instead of logsumexp. Returns both the max and the argmax.
- Parameters:
A (torch.Tensor (batch, dim1, dim2)) – Tensor.
b (torch.Tensor (batch, dim1)) – Tensor
- Returns:
x (torch.Tensor (batch, dim1)) – Tensor.
argmax (torch.Tensor (batch, dim1)) – Tensor.
Example
>>> A = torch.tensor([[[ 0., -1.], ... [ -1e5, 0.]]]) >>> b = torch.tensor([[0., 0.,]]) >>> x, argmax = batch_log_maxvecmul(A, b) >>> x tensor([[0., 0.]]) >>> argmax tensor([[0, 1]])