speechbrain.nnet.CNN module

Library implementing convolutional neural networks.

Authors
  • Mirco Ravanelli 2020

  • Jianyuan Zhong 2020

  • Cem Subakan 2021

  • Davide Borra 2021

  • Andreas Nautsch 2022

  • Sarthak Yadav 2022

Summary

Classes:

Conv1d

This function implements 1d convolution.

Conv2d

This function implements 2d convolution.

Conv2dWithConstraint

This function implements 2d convolution with kernel max-norm constaint.

ConvTranspose1d

This class implements 1d transposed convolution with speechbrain.

DepthwiseSeparableConv1d

This class implements the depthwise separable 1d convolution.

DepthwiseSeparableConv2d

This class implements the depthwise separable 2d convolution.

GaborConv1d

This class implements 1D Gabor Convolutions from

SincConv

This function implements SincConv (SincNet).

Functions:

get_padding_elem

This function computes the number of elements to add for zero-padding.

get_padding_elem_transposed

This function computes the required padding size for transposed convolution

Reference

class speechbrain.nnet.CNN.SincConv(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding='same', padding_mode='reflect', sample_rate=16000, min_low_hz=50, min_band_hz=50)[source]

Bases: Module

This function implements SincConv (SincNet).

M. Ravanelli, Y. Bengio, “Speaker Recognition from raw waveform with SincNet”, in Proc. of SLT 2018 (https://arxiv.org/abs/1808.00158)

Parameters
  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • out_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • groups (int) – This option specifies the convolutional groups. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

  • sample_rate (int,) – Sampling rate of the input signals. It is only used for sinc_conv.

  • min_low_hz (float) – Lowest possible frequency (in Hz) for a filter. It is only used for sinc_conv.

  • min_low_hz – Lowest possible value (in Hz) for a filter bandwidth.

Example

>>> inp_tensor = torch.rand([10, 16000])
>>> conv = SincConv(input_shape=inp_tensor.shape, out_channels=25, kernel_size=11)
>>> out_tensor = conv(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 16000, 25])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool
class speechbrain.nnet.CNN.Conv1d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding='same', groups=1, bias=True, padding_mode='reflect', skip_transpose=False, weight_norm=False, conv_init=None)[source]

Bases: Module

This function implements 1d convolution.

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.

  • groups (int) – Number of blocked connections from input channels to output channels.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

  • weight_norm (bool) – If True, use weight normalization, to be removed with self.remove_weight_norm() at inference

Example

>>> inp_tensor = torch.rand([10, 40, 16])
>>> cnn_1d = Conv1d(
...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=5
... )
>>> out_tensor = cnn_1d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 40, 8])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

remove_weight_norm()[source]

Removes weight normalization at inference if used during training.

training: bool
class speechbrain.nnet.CNN.Conv2d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=(1, 1), dilation=(1, 1), padding='same', groups=1, bias=True, padding_mode='reflect', skip_transpose=False, weight_norm=False, conv_init=None)[source]

Bases: Module

This function implements 2d convolution.

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (tuple) – Kernel size of the 2d convolutional filters over time and frequency axis.

  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • stride (int) – Stride factor of the 2d convolutional filters over time and frequency axis.

  • dilation (int) – Dilation factor of the 2d convolutional filters over time and frequency axis.

  • padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is same as input shape. If “causal” then proper padding is inserted to simulate causal convolution on the first spatial dimension. (spatial dim 1 is dim 3 for both skip_transpose=False and skip_transpose=True)

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • groups (int) – This option specifies the convolutional groups. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

  • skip_transpose (bool) – If False, uses batch x spatial.dim2 x spatial.dim1 x channel convention of speechbrain. If True, uses batch x channel x spatial.dim1 x spatial.dim2 convention.

  • weight_norm (bool) – If True, use weight normalization, to be removed with self.remove_weight_norm() at inference

Example

>>> inp_tensor = torch.rand([10, 40, 16, 8])
>>> cnn_2d = Conv2d(
...     input_shape=inp_tensor.shape, out_channels=5, kernel_size=(7, 3)
... )
>>> out_tensor = cnn_2d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 40, 16, 5])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

remove_weight_norm()[source]

Removes weight normalization at inference if used during training.

training: bool
class speechbrain.nnet.CNN.Conv2dWithConstraint(*args, max_norm=1, **kwargs)[source]

Bases: Conv2d

This function implements 2d convolution with kernel max-norm constaint. This corresponds to set an upper bound for the kernel norm.

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (tuple) – Kernel size of the 2d convolutional filters over time and frequency axis.

  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • stride (int) – Stride factor of the 2d convolutional filters over time and frequency axis.

  • dilation (int) – Dilation factor of the 2d convolutional filters over time and frequency axis.

  • padding (str) – (same, valid). If “valid”, no padding is performed. If “same” and stride is 1, output shape is same as input shape.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • groups (int) – This option specifies the convolutional groups. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

  • max_norm (float) – kernel max-norm

Example

>>> inp_tensor = torch.rand([10, 40, 16, 8])
>>> max_norm = 1
>>> cnn_2d_constrained = Conv2dWithConstraint(
...     in_channels=inp_tensor.shape[-1], out_channels=5, kernel_size=(7, 3)
... )
>>> out_tensor = cnn_2d_constrained(inp_tensor)
>>> torch.any(torch.norm(cnn_2d_constrained.conv.weight.data, p=2, dim=0)>max_norm)
tensor(False)
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool
class speechbrain.nnet.CNN.ConvTranspose1d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding=0, output_padding=0, groups=1, bias=True, skip_transpose=False, weight_norm=False)[source]

Bases: Module

This class implements 1d transposed convolution with speechbrain. Transpose convolution is normally used to perform upsampling.

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, upsampling in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str or int) – To have in output the target dimension, we suggest tuning the kernel size and the padding properly. We also support the following function to have some control over the padding and the corresponding ouput dimensionality. if “valid”, no padding is applied if “same”, padding amount is inferred so that the output size is closest to possible to input size. Note that for some kernel_size / stride combinations it is not possible to obtain the exact same size, but we return the closest possible size. if “factor”, padding amount is inferred so that the output size is closest to inputsize*stride. Note that for some kernel_size / stride combinations it is not possible to obtain the exact size, but we return the closest possible size. if an integer value is entered, a custom padding is used.

  • output_padding (int,) – Additional size added to one side of the output shape

  • groups (int) – Number of blocked connections from input channels to output channels. Default: 1

  • bias (bool) – If True, adds a learnable bias to the output

  • skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

  • weight_norm (bool) – If True, use weight normalization, to be removed with self.remove_weight_norm() at inference

Example

>>> from speechbrain.nnet.CNN import Conv1d, ConvTranspose1d
>>> inp_tensor = torch.rand([10, 12, 40]) #[batch, time, fea]
>>> convtranspose_1d = ConvTranspose1d(
...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=3, stride=2
... )
>>> out_tensor = convtranspose_1d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 25, 8])
>>> # Combination of Conv1d and ConvTranspose1d
>>> from speechbrain.nnet.CNN import Conv1d, ConvTranspose1d
>>> signal = torch.tensor([1,100])
>>> signal = torch.rand([1,100]) #[batch, time]
>>> conv1d = Conv1d(input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2)
>>> conv_out = conv1d(signal)
>>> conv_t = ConvTranspose1d(input_shape=conv_out.shape, out_channels=1, kernel_size=3, stride=2, padding=1)
>>> signal_rec = conv_t(conv_out, output_size=[100])
>>> signal_rec.shape
torch.Size([1, 100])
>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2, padding='same')
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 115])
>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=7, stride=2, padding='valid')
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 235])
>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=7, stride=2, padding='factor')
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 231])
>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2, padding=10)
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 211])
forward(x, output_size=None)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

remove_weight_norm()[source]

Removes weight normalization at inference if used during training.

training: bool
class speechbrain.nnet.CNN.DepthwiseSeparableConv1d(out_channels, kernel_size, input_shape, stride=1, dilation=1, padding='same', bias=True)[source]

Bases: Module

This class implements the depthwise separable 1d convolution.

First, a channel-wise convolution is applied to the input Then, a point-wise convolution to project the input to output

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • input_shape (tuple) – Expected shape of the input.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp = torch.randn([8, 120, 40])
>>> conv = DepthwiseSeparableConv1d(256, 3, input_shape=inp.shape)
>>> out = conv(inp)
>>> out.shape
torch.Size([8, 120, 256])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 3d tensors are expected.

training: bool
class speechbrain.nnet.CNN.DepthwiseSeparableConv2d(out_channels, kernel_size, input_shape, stride=(1, 1), dilation=(1, 1), padding='same', bias=True)[source]

Bases: Module

This class implements the depthwise separable 2d convolution.

First, a channel-wise convolution is applied to the input Then, a point-wise convolution to project the input to output

Parameters
  • ut_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp = torch.randn([8, 120, 40, 1])
>>> conv = DepthwiseSeparableConv2d(256, (3, 3), input_shape=inp.shape)
>>> out = conv(inp)
>>> out.shape
torch.Size([8, 120, 40, 256])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 3d tensors are expected.

training: bool
class speechbrain.nnet.CNN.GaborConv1d(out_channels, kernel_size, stride, input_shape=None, in_channels=None, padding='same', padding_mode='constant', sample_rate=16000, min_freq=60.0, max_freq=None, n_fft=512, normalize_energy=False, bias=False, sort_filters=False, use_legacy_complex=False, skip_transpose=False)[source]

Bases: Module

This class implements 1D Gabor Convolutions from

Neil Zeghidour, Olivier Teboul, F{‘e}lix de Chaumont Quitry & Marco Tagliasacchi, “LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION”, in Proc. of ICLR 2021 (https://arxiv.org/abs/2101.08596)

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.

  • padding (str) – (same, valid). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • sample_rate (int,) – Sampling rate of the input signals. It is only used for sinc_conv.

  • min_freq (float) – Lowest possible frequency (in Hz) for a filter

  • max_freq (float) – Highest possible frequency (in Hz) for a filter

  • n_fft (int) – number of FFT bins for initialization

  • normalize_energy (bool) – whether to normalize energy at initialization. Default is False

  • bias (bool) – If True, the additive bias b is adopted.

  • sort_filters (bool) – whether to sort filters by center frequencies. Default is False

  • use_legacy_complex (bool) – If False, torch.complex64 data type is used for gabor impulse responses If True, computation is performed on two real-valued tensors

  • skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> inp_tensor = torch.rand([10, 8000])
>>> # 401 corresponds to a window of 25 ms at 16000 kHz
>>> gabor_conv = GaborConv1d(
...     40, kernel_size=401, stride=1, in_channels=1
... )
>>> #
>>> out_tensor = gabor_conv(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 8000, 40])
forward(x)[source]

Returns the output of the Gabor convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve.

training: bool
speechbrain.nnet.CNN.get_padding_elem(L_in: int, stride: int, kernel_size: int, dilation: int)[source]

This function computes the number of elements to add for zero-padding.

Parameters
  • L_in (int) –

  • stride (int) –

  • kernel_size (int) –

  • dilation (int) –

speechbrain.nnet.CNN.get_padding_elem_transposed(L_out: int, L_in: int, stride: int, kernel_size: int, dilation: int, output_padding: int)[source]

This function computes the required padding size for transposed convolution

Parameters
  • L_out (int) –

  • L_in (int) –

  • stride (int) –

  • kernel_size (int) –

  • dilation (int) –

  • output_padding (int) –