speechbrain.nnet.CNN module

Library implementing convolutional neural networks.

Authors
  • Mirco Ravanelli 2020

  • Jianyuan Zhong 2020

  • Cem Subakan 2021

  • Davide Borra 2021

Summary

Classes:

Conv1d

This function implements 1d convolution.

Conv2d

This function implements 2d convolution.

Conv2dWithConstraint

This function implements 2d convolution with kernel max-norm constaint.

ConvTranspose1d

This class implements 1d transposed convolution with speechbrain.

DepthwiseSeparableConv1d

This class implements the depthwise separable 1d convolution.

DepthwiseSeparableConv2d

This class implements the depthwise separable 2d convolution.

SincConv

This function implements SincConv (SincNet).

Functions:

get_padding_elem

This function computes the number of elements to add for zero-padding.

get_padding_elem_transposed

This function computes the required padding size for transposed convolution

Reference

class speechbrain.nnet.CNN.SincConv(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding='same', padding_mode='reflect', sample_rate=16000, min_low_hz=50, min_band_hz=50)[source]

Bases: torch.nn.modules.module.Module

This function implements SincConv (SincNet).

M. Ravanelli, Y. Bengio, “Speaker Recognition from raw waveform with SincNet”, in Proc. of SLT 2018 (https://arxiv.org/abs/1808.00158)

Parameters
  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • out_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • groups (int) – This option specifies the convolutional groups. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

  • sample_rate (int,) – Sampling rate of the input signals. It is only used for sinc_conv.

  • min_low_hz (float) – Lowest possible frequency (in Hz) for a filter. It is only used for sinc_conv.

  • min_low_hz – Lowest possible value (in Hz) for a filter bandwidth.

Example

>>> inp_tensor = torch.rand([10, 16000])
>>> conv = SincConv(input_shape=inp_tensor.shape, out_channels=25, kernel_size=11)
>>> out_tensor = conv(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 16000, 25])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool
class speechbrain.nnet.CNN.Conv1d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding='same', groups=1, bias=True, padding_mode='reflect', skip_transpose=False)[source]

Bases: torch.nn.modules.module.Module

This function implements 1d convolution.

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.

  • groups (int) – Number of blocked connections from input channels to output channels.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> inp_tensor = torch.rand([10, 40, 16])
>>> cnn_1d = Conv1d(
...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=5
... )
>>> out_tensor = cnn_1d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 40, 8])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool
class speechbrain.nnet.CNN.Conv2d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=(1, 1), dilation=(1, 1), padding='same', groups=1, bias=True, padding_mode='reflect')[source]

Bases: torch.nn.modules.module.Module

This function implements 2d convolution.

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (tuple) – Kernel size of the 2d convolutional filters over time and frequency axis.

  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • stride (int) – Stride factor of the 2d convolutional filters over time and frequency axis.

  • dilation (int) – Dilation factor of the 2d convolutional filters over time and frequency axis.

  • padding (str) – (same, valid). If “valid”, no padding is performed. If “same” and stride is 1, output shape is same as input shape.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • groups (int) – This option specifies the convolutional groups. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp_tensor = torch.rand([10, 40, 16, 8])
>>> cnn_2d = Conv2d(
...     input_shape=inp_tensor.shape, out_channels=5, kernel_size=(7, 3)
... )
>>> out_tensor = cnn_2d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 40, 16, 5])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool
class speechbrain.nnet.CNN.Conv2dWithConstraint(*args, max_norm=1, **kwargs)[source]

Bases: speechbrain.nnet.CNN.Conv2d

This function implements 2d convolution with kernel max-norm constaint. This corresponds to set an upper bound for the kernel norm.

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (tuple) – Kernel size of the 2d convolutional filters over time and frequency axis.

  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • stride (int) – Stride factor of the 2d convolutional filters over time and frequency axis.

  • dilation (int) – Dilation factor of the 2d convolutional filters over time and frequency axis.

  • padding (str) – (same, valid). If “valid”, no padding is performed. If “same” and stride is 1, output shape is same as input shape.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • groups (int) – This option specifies the convolutional groups. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

  • max_norm (float) – kernel max-norm

Example

>>> inp_tensor = torch.rand([10, 40, 16, 8])
>>> max_norm = 1
>>> cnn_2d_constrained = Conv2dWithConstraint(
...     in_channels=inp_tensor.shape[-1], out_channels=5, kernel_size=(7, 3)
... )
>>> out_tensor = cnn_2d_constrained(inp_tensor)
>>> torch.any(torch.norm(cnn_2d_constrained.conv.weight.data, p=2, dim=0)>max_norm)
tensor(False)
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool
class speechbrain.nnet.CNN.ConvTranspose1d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding=0, output_padding=0, groups=1, bias=True, skip_transpose=False)[source]

Bases: torch.nn.modules.module.Module

This class implements 1d transposed convolution with speechbrain. Transpose convolution is normally used to perform upsampling.

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • input_shape (tuple) – The shape of the input. Alternatively use in_channels.

  • in_channels (int) – The number of input channels. Alternatively use input_shape.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, upsampling in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str or int) – To have in output the target dimension, we suggest tuning the kernel size and the padding properly. We also support the following function to have some control over the padding and the corresponding ouput dimensionality. if “valid”, no padding is applied if “same”, padding amount is inferred so that the output size is closest to possible to input size. Note that for some kernel_size / stride combinations it is not possible to obtain the exact same size, but we return the closest possible size. if “factor”, padding amount is inferred so that the output size is closest to inputsize*stride. Note that for some kernel_size / stride combinations it is not possible to obtain the exact size, but we return the closest possible size. if an integer value is entered, a custom padding is used.

  • output_padding (int,) – Additional size added to one side of the output shape

  • groups (int) – Number of blocked connections from input channels to output channels. Default: 1

  • bias (bool) – If True, adds a learnable bias to the output

  • skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> from speechbrain.nnet.CNN import Conv1d, ConvTranspose1d
>>> inp_tensor = torch.rand([10, 12, 40]) #[batch, time, fea]
>>> convtranspose_1d = ConvTranspose1d(
...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=3, stride=2
... )
>>> out_tensor = convtranspose_1d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 25, 8])
>>> # Combination of Conv1d and ConvTranspose1d
>>> from speechbrain.nnet.CNN import Conv1d, ConvTranspose1d
>>> signal = torch.tensor([1,100])
>>> signal = torch.rand([1,100]) #[batch, time]
>>> conv1d = Conv1d(input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2)
>>> conv_out = conv1d(signal)
>>> conv_t = ConvTranspose1d(input_shape=conv_out.shape, out_channels=1, kernel_size=3, stride=2, padding=1)
>>> signal_rec = conv_t(conv_out, output_size=[100])
>>> signal_rec.shape
torch.Size([1, 100])
>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2, padding='same')
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 115])
>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=7, stride=2, padding='valid')
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 235])
>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=7, stride=2, padding='factor')
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 231])
>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2, padding=10)
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 211])
forward(x, output_size=None)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool
class speechbrain.nnet.CNN.DepthwiseSeparableConv1d(out_channels, kernel_size, input_shape, stride=1, dilation=1, padding='same', bias=True)[source]

Bases: torch.nn.modules.module.Module

This class implements the depthwise separable 1d convolution.

First, a channel-wise convolution is applied to the input Then, a point-wise convolution to project the input to output

Parameters
  • out_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • input_shape (tuple) – Expected shape of the input.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp = torch.randn([8, 120, 40])
>>> conv = DepthwiseSeparableConv1d(256, 3, input_shape=inp.shape)
>>> out = conv(inp)
>>> out.shape
torch.Size([8, 120, 256])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 3d tensors are expected.

training: bool
class speechbrain.nnet.CNN.DepthwiseSeparableConv2d(out_channels, kernel_size, input_shape, stride=(1, 1), dilation=(1, 1), padding='same', bias=True)[source]

Bases: torch.nn.modules.module.Module

This class implements the depthwise separable 2d convolution.

First, a channel-wise convolution is applied to the input Then, a point-wise convolution to project the input to output

Parameters
  • ut_channels (int) – It is the number of output channels.

  • kernel_size (int) – Kernel size of the convolutional filters.

  • stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.

  • dilation (int) – Dilation factor of the convolutional filters.

  • padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.

  • padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.

  • bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp = torch.randn([8, 120, 40, 1])
>>> conv = DepthwiseSeparableConv2d(256, (3, 3), input_shape=inp.shape)
>>> out = conv(inp)
>>> out.shape
torch.Size([8, 120, 40, 256])
forward(x)[source]

Returns the output of the convolution.

Parameters

x (torch.Tensor (batch, time, channel)) – input to convolve. 3d tensors are expected.

training: bool
speechbrain.nnet.CNN.get_padding_elem(L_in: int, stride: int, kernel_size: int, dilation: int)[source]

This function computes the number of elements to add for zero-padding.

Parameters
  • L_in (int) –

  • stride (int) –

  • kernel_size (int) –

  • dilation (int) –

speechbrain.nnet.CNN.get_padding_elem_transposed(L_out: int, L_in: int, stride: int, kernel_size: int, dilation: int, output_padding: int)[source]

This function computes the required padding size for transposed convolution

Parameters
  • L_out (int) –

  • L_in (int) –

  • stride (int) –

  • kernel_size (int) –

  • dilation (int) –

  • output_padding (int) –