speechbrain.nnet.losses module
Losses for training neural networks.
- Authors
Mirco Ravanelli 2020
Samuele Cornell 2020
Hwidong Na 2020
Yan Gao 2020
Titouan Parcollet 2020
Summary
Classes:
An implementation of Additive Angular Margin (AAM) proposed in the following paper: '''Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition''' (https://arxiv.org/abs/1906.07317) |
|
An implementation of Angular Margin (AM) proposed in the following paper: '''Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition''' (https://arxiv.org/abs/1906.07317) |
|
Contrastive loss as used in wav2vec2. |
|
|
|
Permutation Invariant Wrapper to allow Permutation Invariant Training (PIT) with existing losses. |
Functions:
Computes binary cross-entropy (BCE) loss. |
|
Calculate SI-SNR. |
|
Calculate binaural channel SNR. Arguments: --------- source: [T, E, B, C], Where B is batch size, T is the length of the sources, E is binaural channels, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper. estimate_source: [T, E, B, C] The estimated source. |
|
Simple version of distillation for cross-entropy loss. |
|
Computes the classification error at frame or batch level. |
|
Compute the true average loss of a set of waveforms of unequal length. |
|
CTC loss. |
|
Knowledge distillation for CTC loss. |
|
|
|
This function wraps si_snr calculation with the speechbrain pit-wrapper. |
|
This function wraps si_snr calculation with the speechbrain pit-wrapper. Arguments: --------- source: [B, T, E, C], Where B is the batch size, T is the length of the sources, E is binaural channels, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper. estimate_source: [B, T, E, C] The estimated source. |
|
Computes the KL-divergence error at the batch level. |
|
Compute the true l1 loss, accounting for length differences. |
|
Compute the true mean squared error, accounting for length differences. |
|
Computes negative log likelihood loss. |
|
Knowledge distillation for negative log-likelihood loss. |
|
Transducer loss, see speechbrain/nnet/loss/transducer_loss.py. |
|
Ensure that predictions and targets are the same length. |
Reference
- speechbrain.nnet.losses.transducer_loss(logits, targets, input_lens, target_lens, blank_index, reduction='mean', use_torchaudio=True)[source]
Transducer loss, see speechbrain/nnet/loss/transducer_loss.py.
- Parameters
logits (torch.Tensor) – Predicted tensor, of shape [batch, maxT, maxU, num_labels].
targets (torch.Tensor) – Target tensor, without any blanks, of shape [batch, target_len].
input_lens (torch.Tensor) – Length of each utterance.
target_lens (torch.Tensor) – Length of each target sequence.
blank_index (int) – The location of the blank symbol among the label indices.
reduction (str) – Specifies the reduction to apply to the output: ‘mean’ | ‘batchmean’ | ‘sum’.
use_torchaudio (bool) – If True, use Transducer loss implementation from torchaudio, otherwise, use Speechbrain Numba implementation.
- class speechbrain.nnet.losses.PitWrapper(base_loss)[source]
Bases:
Module
Permutation Invariant Wrapper to allow Permutation Invariant Training (PIT) with existing losses.
Permutation invariance is calculated over the sources/classes axis which is assumed to be the rightmost dimension: predictions and targets tensors are assumed to have shape [batch, …, channels, sources].
- Parameters
base_loss (function) – Base loss function, e.g. torch.nn.MSELoss. It is assumed that it takes two arguments: predictions and targets and no reduction is performed. (if a pytorch loss is used, the user must specify reduction=”none”).
- Returns
pit_loss – Torch module supporting forward method for PIT.
- Return type
Example
>>> pit_mse = PitWrapper(nn.MSELoss(reduction="none")) >>> targets = torch.rand((2, 32, 4)) >>> p = (3, 0, 2, 1) >>> predictions = targets[..., p] >>> loss, opt_p = pit_mse(predictions, targets) >>> loss tensor([0., 0.])
- reorder_tensor(tensor, p)[source]
- Parameters
tensor (torch.Tensor) – Tensor to reorder given the optimal permutation, of shape [batch, …, sources].
p (list of tuples) – List of optimal permutations, e.g. for batch=2 and n_sources=3 [(0, 1, 2), (0, 2, 1].
- Returns
reordered – Reordered tensor given permutation p.
- Return type
- forward(preds, targets)[source]
- Parameters
preds (torch.Tensor) – Network predictions tensor, of shape [batch, channels, …, sources].
targets (torch.Tensor) – Target tensor, of shape [batch, channels, …, sources].
- Returns
loss (torch.Tensor) – Permutation invariant loss for current examples, tensor of shape [batch]
perms (list) – List of indexes for optimal permutation of the inputs over sources. e.g., [(0, 1, 2), (2, 1, 0)] for three sources and 2 examples per batch.
- speechbrain.nnet.losses.ctc_loss(log_probs, targets, input_lens, target_lens, blank_index, reduction='mean')[source]
CTC loss.
- Parameters
predictions (torch.Tensor) – Predicted tensor, of shape [batch, time, chars].
targets (torch.Tensor) – Target tensor, without any blanks, of shape [batch, target_len]
input_lens (torch.Tensor) – Length of each utterance.
target_lens (torch.Tensor) – Length of each target sequence.
blank_index (int) – The location of the blank symbol among the character indexes.
reduction (str) – What reduction to apply to the output. ‘mean’, ‘sum’, ‘batch’, ‘batchmean’, ‘none’. See pytorch for ‘mean’, ‘sum’, ‘none’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.
- speechbrain.nnet.losses.l1_loss(predictions, targets, length=None, allowed_len_diff=3, reduction='mean')[source]
Compute the true l1 loss, accounting for length differences.
- Parameters
predictions (torch.Tensor) – Predicted tensor, of shape
[batch, time, *]
.targets (torch.Tensor) – Target tensor with the same size as predicted tensor.
length (torch.Tensor) – Length of each utterance for computing true error with a mask.
allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.
reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.
Example
>>> probs = torch.tensor([[0.9, 0.1, 0.1, 0.9]]) >>> l1_loss(probs, torch.tensor([[1., 0., 0., 1.]])) tensor(0.1000)
- speechbrain.nnet.losses.mse_loss(predictions, targets, length=None, allowed_len_diff=3, reduction='mean')[source]
Compute the true mean squared error, accounting for length differences.
- Parameters
predictions (torch.Tensor) – Predicted tensor, of shape
[batch, time, *]
.targets (torch.Tensor) – Target tensor with the same size as predicted tensor.
length (torch.Tensor) – Length of each utterance for computing true error with a mask.
allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.
reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.
Example
>>> probs = torch.tensor([[0.9, 0.1, 0.1, 0.9]]) >>> mse_loss(probs, torch.tensor([[1., 0., 0., 1.]])) tensor(0.0100)
- speechbrain.nnet.losses.classification_error(probabilities, targets, length=None, allowed_len_diff=3, reduction='mean')[source]
Computes the classification error at frame or batch level.
- Parameters
probabilities (torch.Tensor) – The posterior probabilities of shape [batch, prob] or [batch, frames, prob]
targets (torch.Tensor) – The targets, of shape [batch] or [batch, frames]
length (torch.Tensor) – Length of each utterance, if frame-level loss is desired.
allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.
reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.
Example
>>> probs = torch.tensor([[[0.9, 0.1], [0.1, 0.9]]]) >>> classification_error(probs, torch.tensor([1, 1])) tensor(0.5000)
- speechbrain.nnet.losses.nll_loss(log_probabilities, targets, length=None, label_smoothing=0.0, allowed_len_diff=3, reduction='mean')[source]
Computes negative log likelihood loss.
- Parameters
log_probabilities (torch.Tensor) – The probabilities after log has been applied. Format is [batch, log_p] or [batch, frames, log_p].
targets (torch.Tensor) – The targets, of shape [batch] or [batch, frames].
length (torch.Tensor) – Length of each utterance, if frame-level loss is desired.
allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.
reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.
Example
>>> probs = torch.tensor([[0.9, 0.1], [0.1, 0.9]]) >>> nll_loss(torch.log(probs), torch.tensor([1, 1])) tensor(1.2040)
- speechbrain.nnet.losses.bce_loss(inputs, targets, length=None, weight=None, pos_weight=None, reduction='mean', allowed_len_diff=3, label_smoothing=0.0)[source]
Computes binary cross-entropy (BCE) loss. It also applies the sigmoid function directly (this improves the numerical stability).
- Parameters
inputs (torch.Tensor) – The output before applying the final softmax Format is [batch[, 1]?] or [batch, frames[, 1]?]. (Works with or without a singleton dimension at the end).
targets (torch.Tensor) – The targets, of shape [batch] or [batch, frames].
length (torch.Tensor) – Length of each utterance, if frame-level loss is desired.
weight (torch.Tensor) – A manual rescaling weight if provided it’s repeated to match input tensor shape.
pos_weight (torch.Tensor) – A weight of positive examples. Must be a vector with length equal to the number of classes.
allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.
reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.
Example
>>> inputs = torch.tensor([10.0, -6.0]) >>> targets = torch.tensor([1, 0]) >>> bce_loss(inputs, targets) tensor(0.0013)
- speechbrain.nnet.losses.kldiv_loss(log_probabilities, targets, length=None, label_smoothing=0.0, allowed_len_diff=3, pad_idx=0, reduction='mean')[source]
Computes the KL-divergence error at the batch level. This loss applies label smoothing directly to the targets
- Parameters
probabilities (torch.Tensor) – The posterior probabilities of shape [batch, prob] or [batch, frames, prob].
targets (torch.Tensor) – The targets, of shape [batch] or [batch, frames].
length (torch.Tensor) – Length of each utterance, if frame-level loss is desired.
allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.
reduction (str) – Options are ‘mean’, ‘batch’, ‘batchmean’, ‘sum’. See pytorch for ‘mean’, ‘sum’. The ‘batch’ option returns one loss per item in the batch, ‘batchmean’ returns sum / batch size.
Example
>>> probs = torch.tensor([[0.9, 0.1], [0.1, 0.9]]) >>> kldiv_loss(torch.log(probs), torch.tensor([1, 1])) tensor(1.2040)
- speechbrain.nnet.losses.truncate(predictions, targets, allowed_len_diff=3)[source]
Ensure that predictions and targets are the same length.
- Parameters
predictions (torch.Tensor) – First tensor for checking length.
targets (torch.Tensor) – Second tensor for checking length.
allowed_len_diff (int) – Length difference that will be tolerated before raising an exception.
- speechbrain.nnet.losses.compute_masked_loss(loss_fn, predictions, targets, length=None, label_smoothing=0.0, reduction='mean')[source]
Compute the true average loss of a set of waveforms of unequal length.
- Parameters
loss_fn (function) – A function for computing the loss taking just predictions and targets. Should return all the losses, not a reduction (e.g. reduction=”none”).
predictions (torch.Tensor) – First argument to loss function.
targets (torch.Tensor) – Second argument to loss function.
length (torch.Tensor) – Length of each utterance to compute mask. If None, global average is computed and returned.
label_smoothing (float) – The proportion of label smoothing. Should only be used for NLL loss. Ref: Regularizing Neural Networks by Penalizing Confident Output Distributions. https://arxiv.org/abs/1701.06548
reduction (str) – One of ‘mean’, ‘batch’, ‘batchmean’, ‘none’ where ‘mean’ returns a single value and ‘batch’ returns one per item in the batch and ‘batchmean’ is sum / batch_size and ‘none’ returns all.
- speechbrain.nnet.losses.get_si_snr_with_pitwrapper(source, estimate_source)[source]
This function wraps si_snr calculation with the speechbrain pit-wrapper.
Arguments:
- source: [B, T, C],
Where B is the batch size, T is the length of the sources, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper.
- estimate_source: [B, T, C]
The estimated source.
Example:
>>> x = torch.arange(600).reshape(3, 100, 2) >>> xhat = x[:, :, (1, 0)] >>> si_snr = -get_si_snr_with_pitwrapper(x, xhat) >>> print(si_snr) tensor([135.2284, 135.2284, 135.2284])
- speechbrain.nnet.losses.get_snr_with_pitwrapper(source, estimate_source)[source]
This function wraps si_snr calculation with the speechbrain pit-wrapper. Arguments: ——— source: [B, T, E, C],
Where B is the batch size, T is the length of the sources, E is binaural channels, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper.
- estimate_source: [B, T, E, C]
The estimated source.
- speechbrain.nnet.losses.cal_si_snr(source, estimate_source)[source]
Calculate SI-SNR.
Arguments:
- source: [T, B, C],
Where B is batch size, T is the length of the sources, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper.
- estimate_source: [T, B, C]
The estimated source.
Example:
>>> import numpy as np >>> x = torch.Tensor([[1, 0], [123, 45], [34, 5], [2312, 421]]) >>> xhat = x[:, (1, 0)] >>> x = x.unsqueeze(-1).repeat(1, 1, 2) >>> xhat = xhat.unsqueeze(1).repeat(1, 2, 1) >>> si_snr = -cal_si_snr(x, xhat) >>> print(si_snr) tensor([[[ 25.2142, 144.1789], [130.9283, 25.2142]]])
- speechbrain.nnet.losses.cal_snr(source, estimate_source)[source]
Calculate binaural channel SNR. Arguments: ——— source: [T, E, B, C],
Where B is batch size, T is the length of the sources, E is binaural channels, C is the number of sources the ordering is made so that this loss is compatible with the class PitWrapper.
- estimate_source: [T, E, B, C]
The estimated source.
- speechbrain.nnet.losses.get_mask(source, source_lengths)[source]
- Parameters
source ([T, B, C]) –
source_lengths ([B]) –
- Returns
mask ([T, B, 1])
Example
———
>>> source = torch.randn(4, 3, 2)
>>> source_lengths = torch.Tensor([2, 1, 4]).int()
>>> mask = get_mask(source, source_lengths)
>>> print(mask)
tensor([[[1.], – [1.], [1.]],
<BLANKLINE> –
- [[1.],
[0.], [1.]],
<BLANKLINE> –
- [[0.],
[0.], [1.]],
<BLANKLINE> –
- [[0.],
[0.], [1.]]])
- class speechbrain.nnet.losses.AngularMargin(margin=0.0, scale=1.0)[source]
Bases:
Module
An implementation of Angular Margin (AM) proposed in the following paper: ‘’’Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition’’’ (https://arxiv.org/abs/1906.07317)
- Parameters
- Returns
predictions
- Return type
Example
>>> pred = AngularMargin() >>> outputs = torch.tensor([ [1., -1.], [-1., 1.], [0.9, 0.1], [0.1, 0.9] ]) >>> targets = torch.tensor([ [1., 0.], [0., 1.], [ 1., 0.], [0., 1.] ]) >>> predictions = pred(outputs, targets) >>> predictions[:,0] > predictions[:,1] tensor([ True, False, True, False])
- forward(outputs, targets)[source]
Compute AM between two tensors
- Parameters
outputs (torch.Tensor) – The outputs of shape [N, C], cosine similarity is required.
targets (torch.Tensor) – The targets of shape [N, C], where the margin is applied for.
- Returns
predictions
- Return type
- class speechbrain.nnet.losses.AdditiveAngularMargin(margin=0.0, scale=1.0, easy_margin=False)[source]
Bases:
AngularMargin
An implementation of Additive Angular Margin (AAM) proposed in the following paper: ‘’’Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition’’’ (https://arxiv.org/abs/1906.07317)
- Parameters
- Returns
predictions – Tensor.
- Return type
Example
>>> outputs = torch.tensor([ [1., -1.], [-1., 1.], [0.9, 0.1], [0.1, 0.9] ]) >>> targets = torch.tensor([ [1., 0.], [0., 1.], [ 1., 0.], [0., 1.] ]) >>> pred = AdditiveAngularMargin() >>> predictions = pred(outputs, targets) >>> predictions[:,0] > predictions[:,1] tensor([ True, False, True, False])
- forward(outputs, targets)[source]
Compute AAM between two tensors
- Parameters
outputs (torch.Tensor) – The outputs of shape [N, C], cosine similarity is required.
targets (torch.Tensor) – The targets of shape [N, C], where the margin is applied for.
- Returns
predictions
- Return type
- class speechbrain.nnet.losses.LogSoftmaxWrapper(loss_fn)[source]
Bases:
Module
- Returns
loss (torch.Tensor) – Learning loss
predictions (torch.Tensor) – Log probabilities
Example
>>> outputs = torch.tensor([ [1., -1.], [-1., 1.], [0.9, 0.1], [0.1, 0.9] ]) >>> outputs = outputs.unsqueeze(1) >>> targets = torch.tensor([ [0], [1], [0], [1] ]) >>> log_prob = LogSoftmaxWrapper(nn.Identity()) >>> loss = log_prob(outputs, targets) >>> 0 <= loss < 1 tensor(True) >>> log_prob = LogSoftmaxWrapper(AngularMargin(margin=0.2, scale=32)) >>> loss = log_prob(outputs, targets) >>> 0 <= loss < 1 tensor(True) >>> outputs = torch.tensor([ [1., -1.], [-1., 1.], [0.9, 0.1], [0.1, 0.9] ]) >>> log_prob = LogSoftmaxWrapper(AdditiveAngularMargin(margin=0.3, scale=32)) >>> loss = log_prob(outputs, targets) >>> 0 <= loss < 1 tensor(True)
- forward(outputs, targets, length=None)[source]
- Parameters
outputs (torch.Tensor) – Network output tensor, of shape [batch, 1, outdim].
targets (torch.Tensor) – Target tensor, of shape [batch, 1].
- Returns
loss – Loss for current examples.
- Return type
- speechbrain.nnet.losses.ctc_loss_kd(log_probs, targets, input_lens, blank_index, device)[source]
Knowledge distillation for CTC loss.
Reference
Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition. https://arxiv.org/abs/2005.09310
- param log_probs
Predicted tensor from student model, of shape [batch, time, chars].
- type log_probs
torch.Tensor
- param targets
Predicted tensor from single teacher model, of shape [batch, time, chars].
- type targets
torch.Tensor
- param input_lens
Length of each utterance.
- type input_lens
torch.Tensor
- param blank_index
The location of the blank symbol among the character indexes.
- type blank_index
int
- param device
Device for computing.
- type device
str
- speechbrain.nnet.losses.ce_kd(inp, target)[source]
Simple version of distillation for cross-entropy loss.
- Parameters
inp (torch.Tensor) – The probabilities from student model, of shape [batch_size * length, feature]
target (torch.Tensor) – The probabilities from teacher model, of shape [batch_size * length, feature]
- speechbrain.nnet.losses.nll_loss_kd(probabilities, targets, rel_lab_lengths)[source]
Knowledge distillation for negative log-likelihood loss.
Reference
Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition. https://arxiv.org/abs/2005.09310
- param probabilities
The predicted probabilities from the student model. Format is [batch, frames, p]
- type probabilities
torch.Tensor
- param targets
The target probabilities from the teacher model. Format is [batch, frames, p]
- type targets
torch.Tensor
- param rel_lab_lengths
Length of each utterance, if the frame-level loss is desired.
- type rel_lab_lengths
torch.Tensor
Example
>>> probabilities = torch.tensor([[[0.8, 0.2], [0.2, 0.8]]]) >>> targets = torch.tensor([[[0.9, 0.1], [0.1, 0.9]]]) >>> rel_lab_lengths = torch.tensor([1.]) >>> nll_loss_kd(probabilities, targets, rel_lab_lengths) tensor(-0.7400)
- class speechbrain.nnet.losses.ContrastiveLoss(logit_temp)[source]
Bases:
Module
Contrastive loss as used in wav2vec2.
Reference
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations https://arxiv.org/abs/2006.11477
- param logit_temp
A temperature to devide the logits.
- type logit_temp
torch.Float
- forward(x, y, negs)[source]
- Parameters
x (torch.Tensor) – Encoded embeddings with shape (B, T, C).
y (torch.Tensor) – Feature extractor target embeddings with shape (B, T, C).
negs (torch.Tensor) – Negative embeddings from feature extractor with shape (N, B, T, C)
sample_negatives (where N is number of negatives. Can be obtained with our) –
lobes/wav2vec2). (function (check in) –