speechbrain.lobes.models.Cnn14 moduleο
This file implements the CNN14 model from https://arxiv.org/abs/1912.10211
Authors * Cem Subakan 2022 * Francesco Paissan 2022
Summaryο
Classes:
This class estimates a mel-domain saliency mask |
|
This class estimates a saliency map on the STFT domain, given classifier representations. |
|
This class implements the Cnn14 model from https://arxiv.org/abs/1912.10211 |
|
This class implements the convolutional block used in CNN14 |
Functions:
Initialize a Batchnorm layer. |
|
Initialize a Linear or Convolutional layer. |
Referenceο
- speechbrain.lobes.models.Cnn14.init_layer(layer)[source]ο
Initialize a Linear or Convolutional layer.
- class speechbrain.lobes.models.Cnn14.ConvBlock(in_channels, out_channels, norm_type)[source]ο
Bases:
Module
This class implements the convolutional block used in CNN14
- Parameters:
Example
>>> convblock = ConvBlock(10, 20, 'ln') >>> x = torch.rand(5, 10, 20, 30) >>> y = convblock(x) >>> print(y.shape) torch.Size([5, 20, 10, 15])
- forward(x, pool_size=(2, 2), pool_type='avg')[source]ο
The forward pass for convblocks in CNN14
- Parameters:
x (torch.Tensor) β
input tensor with shape B x C_in x D1 x D2 where B = Batchsize
C_in = Number of input channel D1 = Dimensionality of the first spatial dim D2 = Dimensionality of the second spatial dim
pool_size (tuple with integer values) β Amount of pooling at each layer
pool_type (str in ['max', 'avg', 'avg+max']) β The type of pooling
- Return type:
The output of one conv block
- class speechbrain.lobes.models.Cnn14.Cnn14(mel_bins, emb_dim, norm_type='bn', return_reps=False, l2i=False)[source]ο
Bases:
Module
This class implements the Cnn14 model from https://arxiv.org/abs/1912.10211
- Parameters:
mel_bins (int) β Number of mel frequency bins in the input
emb_dim (int) β The dimensionality of the output embeddings
norm_type (str in ['bn', 'in', 'ln']) β The type of normalization
return_reps (bool (default=False)) β If True the model returns intermediate representations as well for interpretation
l2i (bool) β If True, remove one of the outputs.
Example
>>> cnn14 = Cnn14(120, 256) >>> x = torch.rand(3, 400, 120) >>> h = cnn14.forward(x) >>> print(h.shape) torch.Size([3, 1, 256])
- forward(x)[source]ο
The forward pass for the CNN14 encoder
- Parameters:
x (torch.Tensor) β
input tensor with shape B x C_in x D1 x D2 where B = Batchsize
C_in = Number of input channel D1 = Dimensionality of the first spatial dim D2 = Dimensionality of the second spatial dim
- Return type:
Outputs of CNN14 encoder
- class speechbrain.lobes.models.Cnn14.CNN14PSI(dim=128)[source]ο
Bases:
Module
This class estimates a mel-domain saliency mask
- Parameters:
dim (int) β Dimensionality of the embeddings
- Return type:
Estimated saliency map (before sigmoid)
Example
>>> from speechbrain.lobes.models.Cnn14 import Cnn14 >>> classifier_embedder = Cnn14(mel_bins=80, emb_dim=2048, return_reps=True) >>> x = torch.randn(2, 201, 80) >>> _, hs = classifier_embedder(x) >>> psimodel = CNN14PSI(2048) >>> xhat = psimodel.forward(hs) >>> print(xhat.shape) torch.Size([2, 1, 201, 80])
- class speechbrain.lobes.models.Cnn14.CNN14PSI_stft(dim=128, outdim=1)[source]ο
Bases:
Module
This class estimates a saliency map on the STFT domain, given classifier representations.
- Parameters:
Example
>>> from speechbrain.lobes.models.Cnn14 import Cnn14 >>> classifier_embedder = Cnn14(mel_bins=80, emb_dim=2048, return_reps=True) >>> x = torch.randn(2, 201, 80) >>> _, hs = classifier_embedder(x) >>> psimodel = CNN14PSI_stft(2048, 1) >>> xhat = psimodel.forward(hs) >>> print(xhat.shape) torch.Size([2, 1, 201, 513])