speechbrain.lobes.models.L2I module

This file implements the necessary classes and functions to implement Listen-to-Interpret (L2I) interpretation method from https://arxiv.org/abs/2202.11479v2

Authors * Cem Subakan 2022 * Francesco Paissan 2022

Summary

Classes:

CNN14PSI_stft

This class estimates a saliency map on the STFT domain, given classifier representations.

CNN14PSI_stft_2d

This class estimates the NMF activations to create a saliency map using the L2I framework

NMFDecoderAudio

This class implements an NMF decoder

NMFEncoder

This class implements an NMF encoder with a convolutional network

Psi

Convolutional Layers to estimate NMF Activations from Classifier Representations

PsiOptimized

Convolutional Layers to estimate NMF Activations from Classifier Representations, optimized for log-spectra.

Theta

This class implements a linear classifier on top of NMF activations

Functions:

weights_init

Applies Xavier initialization to network weights.

Reference

class speechbrain.lobes.models.L2I.Psi(n_comp=100, T=431, in_emb_dims=[2048, 1024, 512])[source]

Bases: Module

Convolutional Layers to estimate NMF Activations from Classifier Representations

Parameters:
  • n_comp (int) – Number of NMF components (or equivalently number of neurons at the output per timestep)

  • T (int) – The targeted length along the time dimension

  • in_emb_dims (List with int elements) – A list with length 3 that contains the dimensionality of the input dimensions The list needs to match the number of channels in the input classifier representations The last entry should be the smallest entry

Example

>>> inp = [torch.ones(2, 150, 6, 2), torch.ones(2, 100, 6, 2), torch.ones(2, 50, 12, 5)]
>>> psi = Psi(n_comp=100, T=120, in_emb_dims=[150, 100, 50])
>>> h = psi(inp)
>>> print(h.shape)
torch.Size([2, 100, 120])
forward(inp)[source]

This forward function returns the NMF time activations given classifier activations

Parameters:

inp (list) – A length 3 list of classifier input representations.

Return type:

NMF time activations

class speechbrain.lobes.models.L2I.NMFDecoderAudio(n_comp=100, n_freq=513, device='cuda')[source]

Bases: Module

This class implements an NMF decoder

Parameters:
  • n_comp (int) – Number of NMF components

  • n_freq (int) – The number of frequency bins in the NMF dictionary

  • device (str) – The device to run the model

Example

>>> NMF_dec = NMFDecoderAudio(20, 210, device='cpu')
>>> H = torch.rand(1, 20, 150)
>>> Xhat = NMF_dec.forward(H)
>>> print(Xhat.shape)
torch.Size([1, 210, 150])
forward(H)[source]

The forward pass for NMF given the activations H

Parameters:

H (torch.Tensor) –

The activations Tensor with shape B x n_comp x T where B = Batchsize

n_comp = number of NMF components T = number of timepoints

Returns:

output – The NMF outputs

Return type:

torch.Tensor

return_W()[source]

This function returns the NMF dictionary

speechbrain.lobes.models.L2I.weights_init(m)[source]

Applies Xavier initialization to network weights.

Parameters:

m (nn.Module) – Module to initialize.

class speechbrain.lobes.models.L2I.PsiOptimized(dim=128, K=100, numclasses=50, use_adapter=False, adapter_reduce_dim=True)[source]

Bases: Module

Convolutional Layers to estimate NMF Activations from Classifier Representations, optimized for log-spectra.

Parameters:
  • dim (int) – Dimension of the hidden representations (input to the classifier).

  • K (int) – Number of NMF components (or equivalently number of neurons at the output per timestep)

  • numclasses (int) – Number of possible classes.

  • use_adapter (bool) – True if you wish to learn an adapter for the latent representations.

  • adapter_reduce_dim (bool) – True if the adapter should compress the latent representations.

Example

>>> inp = torch.randn(1, 256, 26, 32)
>>> psi = PsiOptimized(dim=256, K=100, use_adapter=False, adapter_reduce_dim=False)
>>> h, inp_ad= psi(inp)
>>> print(h.shape, inp_ad.shape)
torch.Size([1, 1, 417, 100]) torch.Size([1, 256, 26, 32])
forward(hs)[source]

Computes forward step.

Parameters:

hs (torch.Tensor) – Latent representations (input to the classifier). Expected shape torch.Size([B, C, H, W]).

Returns:

NMF activations and adapted representations. Shape `torch.Size([B, 1, T, 100])`.

Return type:

torch.Tensor

class speechbrain.lobes.models.L2I.Theta(n_comp=100, T=431, num_classes=50)[source]

Bases: Module

This class implements a linear classifier on top of NMF activations

Parameters:
  • n_comp (int) – Number of NMF components

  • T (int) – Number of Timepoints in the NMF activations

  • num_classes (int) – Number of classes that the classifier works with

Example

>>> theta = Theta(30, 120, 50)
>>> H = torch.rand(1, 30, 120)
>>> c_hat = theta.forward(H)
>>> print(c_hat.shape)
torch.Size([1, 50])
forward(H)[source]

We first collapse the time axis, and then pass through the linear layer

Parameters:

H (torch.Tensor) –

The activations Tensor with shape B x n_comp x T where B = Batchsize

n_comp = number of NMF components T = number of timepoints

Returns:

theta_out – Classifier output

Return type:

torch.Tensor

class speechbrain.lobes.models.L2I.NMFEncoder(n_freq, n_comp)[source]

Bases: Module

This class implements an NMF encoder with a convolutional network

Parameters:
  • n_freq (int) – The number of frequency bins in the NMF dictionary

  • n_comp (int) – Number of NMF components

Example

>>> nmfencoder = NMFEncoder(513, 100)
>>> X = torch.rand(1, 513, 240)
>>> Hhat = nmfencoder(X)
>>> print(Hhat.shape)
torch.Size([1, 100, 240])
forward(X)[source]
Parameters:

X (torch.Tensor) –

The input spectrogram Tensor with shape B x n_freq x T where B = Batchsize

n_freq = nfft for the input spectrogram T = number of timepoints

Return type:

NMF encoded outputs.

class speechbrain.lobes.models.L2I.CNN14PSI_stft(dim=128, K=100)[source]

Bases: Module

This class estimates a saliency map on the STFT domain, given classifier representations.

Parameters:
  • dim (int) – Dimensionality of the input representations.

  • K (int) – Defines the number of output channels in the saliency map.

Example

>>> from speechbrain.lobes.models.Cnn14 import Cnn14
>>> classifier_embedder = Cnn14(mel_bins=80, emb_dim=2048, return_reps=True)
>>> x = torch.randn(2, 201, 80)
>>> _, hs = classifier_embedder(x)
>>> psimodel = CNN14PSI_stft(2048, 20)
>>> xhat = psimodel.forward(hs)
>>> print(xhat.shape)
torch.Size([2, 20, 207])
forward(hs, labels=None)[source]

Forward step. Estimates NMF activations to be used to get the saliency mask.

Parameters:
  • hs (torch.Tensor) – Classifier’s representations.

  • labels (torch.Tensor) – Predicted labels for classifier’s representations.

Returns:

xhat – The estimated NMF activation coefficients

Return type:

torch.Tensor

class speechbrain.lobes.models.L2I.CNN14PSI_stft_2d(dim=128, K=100)[source]

Bases: Module

This class estimates the NMF activations to create a saliency map using the L2I framework

Parameters:
  • dim (int) – Dimensionality of the input representations.

  • K (int) – Defines the number of output channels in the saliency map.

Example

>>> from speechbrain.lobes.models.Cnn14 import Cnn14
>>> classifier_embedder = Cnn14(mel_bins=80, emb_dim=2048, return_reps=True)
>>> x = torch.randn(2, 201, 80)
>>> _, hs = classifier_embedder(x)
>>> psimodel = CNN14PSI_stft_2d(2048, 20)
>>> xhat = psimodel.forward(hs)
>>> print(xhat.shape)
torch.Size([2, 20, 207])
forward(hs, labels=None)[source]

Forward step. Estimates NMF activations to be used to get the saliency mask.

Parameters:
  • hs (torch.Tensor) – Classifier’s representations.

  • labels (torch.Tensor) – Predicted labels for classifier’s representations.

Returns:

xhat – The estimated NMF activation coefficients

Return type:

torch.Tensor