speechbrain.lobes.models.Cnn14 module

This file implements the CNN14 model from https://arxiv.org/abs/1912.10211

Authors * Cem Subakan 2022 * Francesco Paissan 2022

Summary

Classes:

CNN14PSI

This class estimates a mel-domain saliency mask

CNN14PSI_stft

This class estimates a saliency map on the STFT domain, given classifier representations.

Cnn14

This class implements the Cnn14 model from https://arxiv.org/abs/1912.10211

ConvBlock

This class implements the convolutional block used in CNN14

Functions:

init_bn

Initialize a Batchnorm layer.

init_layer

Initialize a Linear or Convolutional layer.

Reference

speechbrain.lobes.models.Cnn14.init_layer(layer)[source]

Initialize a Linear or Convolutional layer.

speechbrain.lobes.models.Cnn14.init_bn(bn)[source]

Initialize a Batchnorm layer.

class speechbrain.lobes.models.Cnn14.ConvBlock(in_channels, out_channels, norm_type)[source]

Bases: Module

This class implements the convolutional block used in CNN14

Parameters:
  • in_channels (int) – Number of input channels

  • out_channels (int) – Number of output channels

  • norm_type (str in ['bn', 'in', 'ln']) – The type of normalization

Example

>>> convblock = ConvBlock(10, 20, 'ln')
>>> x = torch.rand(5, 10, 20, 30)
>>> y = convblock(x)
>>> print(y.shape)
torch.Size([5, 20, 10, 15])
init_weight()[source]

Initializes the model convolutional layers and the batchnorm layers

forward(x, pool_size=(2, 2), pool_type='avg')[source]

The forward pass for convblocks in CNN14

Parameters:
  • x (torch.Tensor) –

    input tensor with shape B x C_in x D1 x D2 where B = Batchsize

    C_in = Number of input channel D1 = Dimensionality of the first spatial dim D2 = Dimensionality of the second spatial dim

  • pool_size (tuple with integer values) – Amount of pooling at each layer

  • pool_type (str in ['max', 'avg', 'avg+max']) – The type of pooling

Return type:

The output of one conv block

class speechbrain.lobes.models.Cnn14.Cnn14(mel_bins, emb_dim, norm_type='bn', return_reps=False, l2i=False)[source]

Bases: Module

This class implements the Cnn14 model from https://arxiv.org/abs/1912.10211

Parameters:
  • mel_bins (int) – Number of mel frequency bins in the input

  • emb_dim (int) – The dimensionality of the output embeddings

  • norm_type (str in ['bn', 'in', 'ln']) – The type of normalization

  • return_reps (bool (default=False)) – If True the model returns intermediate representations as well for interpretation

  • l2i (bool) – If True, remove one of the outputs.

Example

>>> cnn14 = Cnn14(120, 256)
>>> x = torch.rand(3, 400, 120)
>>> h = cnn14.forward(x)
>>> print(h.shape)
torch.Size([3, 1, 256])
init_weight()[source]

Initializes the model batch norm layer

forward(x)[source]

The forward pass for the CNN14 encoder

Parameters:

x (torch.Tensor) –

input tensor with shape B x C_in x D1 x D2 where B = Batchsize

C_in = Number of input channel D1 = Dimensionality of the first spatial dim D2 = Dimensionality of the second spatial dim

Return type:

Outputs of CNN14 encoder

class speechbrain.lobes.models.Cnn14.CNN14PSI(dim=128)[source]

Bases: Module

This class estimates a mel-domain saliency mask

Parameters:

dim (int) – Dimensionality of the embeddings

Return type:

Estimated saliency map (before sigmoid)

Example

>>> from speechbrain.lobes.models.Cnn14 import Cnn14
>>> classifier_embedder = Cnn14(mel_bins=80, emb_dim=2048, return_reps=True)
>>> x = torch.randn(2, 201, 80)
>>> _, hs = classifier_embedder(x)
>>> psimodel = CNN14PSI(2048)
>>> xhat = psimodel.forward(hs)
>>> print(xhat.shape)
torch.Size([2, 1, 201, 80])
forward(hs, labels=None)[source]

Forward step. Given the classifier representations estimates a saliency map.

Parameters:
  • hs (torch.Tensor) – Classifier’s representations.

  • labels (None) – Unused

Returns:

xhat – Estimated saliency map (before sigmoid)

Return type:

torch.Tensor

class speechbrain.lobes.models.Cnn14.CNN14PSI_stft(dim=128, outdim=1)[source]

Bases: Module

This class estimates a saliency map on the STFT domain, given classifier representations.

Parameters:
  • dim (int) – Dimensionality of the input representations.

  • outdim (int) – Defines the number of output channels in the saliency map.

Example

>>> from speechbrain.lobes.models.Cnn14 import Cnn14
>>> classifier_embedder = Cnn14(mel_bins=80, emb_dim=2048, return_reps=True)
>>> x = torch.randn(2, 201, 80)
>>> _, hs = classifier_embedder(x)
>>> psimodel = CNN14PSI_stft(2048, 1)
>>> xhat = psimodel.forward(hs)
>>> print(xhat.shape)
torch.Size([2, 1, 201, 513])
forward(hs)[source]

Forward step to estimate the saliency map

Parameters:

hs (torch.Tensor) – Classifier’s representations.

Returns:

xhat – An Estimate for the saliency map

Return type:

torch.Tensor