speechbrain.nnet.CNN module

Library implementing convolutional neural networks.

Authors

Mirco Ravanelli 2020
Jianyuan Zhong 2020
Cem Subakan 2021
Davide Borra 2021
Andreas Nautsch 2022
Sarthak Yadav 2022

Summary

Classes:

`Conv1d`	This function implements 1d convolution.
`Conv2d`	This function implements 2d convolution.
`ConvTranspose1d`	This class implements 1d transposed convolution with speechbrain.
`DepthwiseSeparableConv1d`	This class implements the depthwise separable 1d convolution.
`DepthwiseSeparableConv2d`	This class implements the depthwise separable 2d convolution.
`GaborConv1d`	This class implements 1D Gabor Convolutions from
`SincConv`	This function implements SincConv (SincNet).

Functions:

`get_padding_elem`	This function computes the number of elements to add for zero-padding.
`get_padding_elem_transposed`	This function computes the required padding size for transposed convolution

Reference

class speechbrain.nnet.CNN.SincConv(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding='same', padding_mode='reflect', sample_rate=16000, min_low_hz=50, min_band_hz=50)[source]

Bases: Module

This function implements SincConv (SincNet).

M. Ravanelli, Y. Bengio, “Speaker Recognition from raw waveform with SincNet”, in Proc. of SLT 2018 (https://arxiv.org/abs/1808.00158)

Parameters:

out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
input_shape (tuple) – The shape of the input. Alternatively use in_channels.
in_channels (int) – The number of input channels. Alternatively use input_shape.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.
padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.
sample_rate (int) – Sampling rate of the input signals. It is only used for sinc_conv.
min_low_hz (float) – Lowest possible frequency (in Hz) for a filter. It is only used for sinc_conv.
min_band_hz (float) – Lowest possible value (in Hz) for a filter bandwidth.

Example

>>> inp_tensor = torch.rand([10, 16000])
>>> conv = SincConv(
...     input_shape=inp_tensor.shape, out_channels=25, kernel_size=11
... )
>>> out_tensor = conv(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 16000, 25])

forward(x)[source]

Returns the output of the convolution.

Parameters:: x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.
Returns:: wx – The convolved outputs.
Return type:: torch.Tensor

class speechbrain.nnet.CNN.Conv1d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding='same', groups=1, bias=True, padding_mode='reflect', skip_transpose=False, weight_norm=False, conv_init=None, default_padding=0)[source]

Bases: Module

This function implements 1d convolution.

Parameters:

out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
input_shape (tuple) – The shape of the input. Alternatively use in_channels.
in_channels (int) – The number of input channels. Alternatively use input_shape.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.
groups (int) – Number of blocked connections from input channels to output channels.
bias (bool) – Whether to add a bias term to convolution operation.
padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.
skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.
weight_norm (bool) – If True, use weight normalization, to be removed with self.remove_weight_norm() at inference
conv_init (str) – Weight initialization for the convolution network
default_padding (str or int) – This sets the default padding mode that will be used by the pytorch Conv1d backend.

Example

>>> inp_tensor = torch.rand([10, 40, 16])
>>> cnn_1d = Conv1d(
...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=5
... )
>>> out_tensor = cnn_1d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 40, 8])

forward(x)[source]

Returns the output of the convolution.

Parameters:: x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.
Returns:: wx – The convolved outputs.
Return type:: torch.Tensor

remove_weight_norm()[source]: Removes weight normalization at inference if used during training.

class speechbrain.nnet.CNN.Conv2d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=(1, 1), dilation=(1, 1), padding='same', groups=1, bias=True, padding_mode='reflect', max_norm=None, swap=False, skip_transpose=False, weight_norm=False, conv_init=None)[source]

Bases: Module

This function implements 2d convolution.

Parameters:

out_channels (int) – It is the number of output channels.
kernel_size (tuple) – Kernel size of the 2d convolutional filters over time and frequency axis.
input_shape (tuple) – The shape of the input. Alternatively use in_channels.
in_channels (int) – The number of input channels. Alternatively use input_shape.
stride (int) – Stride factor of the 2d convolutional filters over time and frequency axis.
dilation (int) – Dilation factor of the 2d convolutional filters over time and frequency axis.
padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is same as input shape. If “causal” then proper padding is inserted to simulate causal convolution on the first spatial dimension. (spatial dim 1 is dim 3 for both skip_transpose=False and skip_transpose=True)
groups (int) – This option specifies the convolutional groups. See torch.nn documentation for more information.
bias (bool) – If True, the additive bias b is adopted.
padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.
max_norm (float) – kernel max-norm.
swap (bool) – If True, the convolution is done with the format (B, C, W, H). If False, the convolution is dine with (B, H, W, C). Active only if skip_transpose is False.
skip_transpose (bool) – If False, uses batch x spatial.dim2 x spatial.dim1 x channel convention of speechbrain. If True, uses batch x channel x spatial.dim1 x spatial.dim2 convention.
weight_norm (bool) – If True, use weight normalization, to be removed with self.remove_weight_norm() at inference
conv_init (str) – Weight initialization for the convolution network

Example

>>> inp_tensor = torch.rand([10, 40, 16, 8])
>>> cnn_2d = Conv2d(
...     input_shape=inp_tensor.shape, out_channels=5, kernel_size=(7, 3)
... )
>>> out_tensor = cnn_2d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 40, 16, 5])

forward(x)[source]

Returns the output of the convolution.

Parameters:: x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.
Returns:: x – The output of the convolution.
Return type:: torch.Tensor

remove_weight_norm()[source]: Removes weight normalization at inference if used during training.

class speechbrain.nnet.CNN.ConvTranspose1d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding=0, output_padding=0, groups=1, bias=True, skip_transpose=False, weight_norm=False)[source]

Bases: Module

This class implements 1d transposed convolution with speechbrain. Transpose convolution is normally used to perform upsampling.

Parameters:

out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
input_shape (tuple) – The shape of the input. Alternatively use in_channels.
in_channels (int) – The number of input channels. Alternatively use input_shape.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, upsampling in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str or int) – To have in output the target dimension, we suggest tuning the kernel size and the padding properly. We also support the following function to have some control over the padding and the corresponding output dimensionality. if “valid”, no padding is applied if “same”, padding amount is inferred so that the output size is closest to possible to input size. Note that for some kernel_size / stride combinations it is not possible to obtain the exact same size, but we return the closest possible size. if “factor”, padding amount is inferred so that the output size is closest to inputsize*stride. Note that for some kernel_size / stride combinations it is not possible to obtain the exact size, but we return the closest possible size. if an integer value is entered, a custom padding is used.
output_padding (int,) – Additional size added to one side of the output shape
groups (int) – Number of blocked connections from input channels to output channels. Default: 1
bias (bool) – If True, adds a learnable bias to the output
skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.
weight_norm (bool) – If True, use weight normalization, to be removed with self.remove_weight_norm() at inference

Example

>>> from speechbrain.nnet.CNN import Conv1d, ConvTranspose1d
>>> inp_tensor = torch.rand([10, 12, 40])  # [batch, time, fea]
>>> convtranspose_1d = ConvTranspose1d(
...     input_shape=inp_tensor.shape,
...     out_channels=8,
...     kernel_size=3,
...     stride=2,
... )
>>> out_tensor = convtranspose_1d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 25, 8])

>>> # Combination of Conv1d and ConvTranspose1d
>>> from speechbrain.nnet.CNN import Conv1d, ConvTranspose1d
>>> signal = torch.tensor([1, 100])
>>> signal = torch.rand([1, 100])  # [batch, time]
>>> conv1d = Conv1d(
...     input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2
... )
>>> conv_out = conv1d(signal)
>>> conv_t = ConvTranspose1d(
...     input_shape=conv_out.shape,
...     out_channels=1,
...     kernel_size=3,
...     stride=2,
...     padding=1,
... )
>>> signal_rec = conv_t(conv_out, output_size=[100])
>>> signal_rec.shape
torch.Size([1, 100])

>>> signal = torch.rand([1, 115])  # [batch, time]
>>> conv_t = ConvTranspose1d(
...     input_shape=signal.shape,
...     out_channels=1,
...     kernel_size=3,
...     stride=2,
...     padding="same",
... )
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 115])

>>> signal = torch.rand([1, 115])  # [batch, time]
>>> conv_t = ConvTranspose1d(
...     input_shape=signal.shape,
...     out_channels=1,
...     kernel_size=7,
...     stride=2,
...     padding="valid",
... )
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 235])

>>> signal = torch.rand([1, 115])  # [batch, time]
>>> conv_t = ConvTranspose1d(
...     input_shape=signal.shape,
...     out_channels=1,
...     kernel_size=7,
...     stride=2,
...     padding="factor",
... )
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 231])

>>> signal = torch.rand([1, 115])  # [batch, time]
>>> conv_t = ConvTranspose1d(
...     input_shape=signal.shape,
...     out_channels=1,
...     kernel_size=3,
...     stride=2,
...     padding=10,
... )
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 211])

forward(x, output_size=None)[source]

Returns the output of the convolution.

Parameters:

x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.
output_size (int) – The size of the output

Returns:

x – The convolved output

Return type:

torch.Tensor

remove_weight_norm()[source]: Removes weight normalization at inference if used during training.

class speechbrain.nnet.CNN.DepthwiseSeparableConv1d(out_channels, kernel_size, input_shape, stride=1, dilation=1, padding='same', bias=True)[source]

Bases: Module

This class implements the depthwise separable 1d convolution.

First, a channel-wise convolution is applied to the input Then, a point-wise convolution to project the input to output

Parameters:

out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
input_shape (tuple) – Expected shape of the input.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.
bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp = torch.randn([8, 120, 40])
>>> conv = DepthwiseSeparableConv1d(256, 3, input_shape=inp.shape)
>>> out = conv(inp)
>>> out.shape
torch.Size([8, 120, 256])

forward(x)[source]

Returns the output of the convolution.

Parameters:: x (torch.Tensor (batch, time, channel)) – input to convolve. 3d tensors are expected.
Return type:: The convolved outputs.

class speechbrain.nnet.CNN.DepthwiseSeparableConv2d(out_channels, kernel_size, input_shape, stride=(1, 1), dilation=(1, 1), padding='same', bias=True)[source]

Bases: Module

This class implements the depthwise separable 2d convolution.

First, a channel-wise convolution is applied to the input Then, a point-wise convolution to project the input to output

Parameters:

out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
input_shape (tuple) – Expected shape of the input tensors.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.
bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp = torch.randn([8, 120, 40, 1])
>>> conv = DepthwiseSeparableConv2d(256, (3, 3), input_shape=inp.shape)
>>> out = conv(inp)
>>> out.shape
torch.Size([8, 120, 40, 256])

forward(x)[source]

Returns the output of the convolution.

Parameters:: x (torch.Tensor (batch, time, channel)) – input to convolve. 3d tensors are expected.
Returns:: out – The convolved output.
Return type:: torch.Tensor

class speechbrain.nnet.CNN.GaborConv1d(out_channels, kernel_size, stride, input_shape=None, in_channels=None, padding='same', padding_mode='constant', sample_rate=16000, min_freq=60.0, max_freq=None, n_fft=512, normalize_energy=False, bias=False, sort_filters=False, use_legacy_complex=False, skip_transpose=False)[source]

Bases: Module

This class implements 1D Gabor Convolutions from

Neil Zeghidour, Olivier Teboul, F{‘e}lix de Chaumont Quitry & Marco Tagliasacchi, “LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION”, in Proc. of ICLR 2021 (https://arxiv.org/abs/2101.08596)

Parameters:

out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.
input_shape (tuple) – Expected shape of the input.
in_channels (int) – Number of channels expected in the input.
padding (str) – (same, valid). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape.
padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.
sample_rate (int,) – Sampling rate of the input signals. It is only used for sinc_conv.
min_freq (float) – Lowest possible frequency (in Hz) for a filter
max_freq (float) – Highest possible frequency (in Hz) for a filter
n_fft (int) – number of FFT bins for initialization
normalize_energy (bool) – whether to normalize energy at initialization. Default is False
bias (bool) – If True, the additive bias b is adopted.
sort_filters (bool) – whether to sort filters by center frequencies. Default is False
use_legacy_complex (bool) – If False, torch.complex64 data type is used for gabor impulse responses If True, computation is performed on two real-valued tensors
skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> inp_tensor = torch.rand([10, 8000])
>>> # 401 corresponds to a window of 25 ms at 16000 kHz
>>> gabor_conv = GaborConv1d(40, kernel_size=401, stride=1, in_channels=1)
>>> #
>>> out_tensor = gabor_conv(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 8000, 40])

forward(x)[source]

Returns the output of the Gabor convolution.

Parameters:: x (torch.Tensor (batch, time, channel)) – input to convolve.
Returns:: x – The output of the Gabor convolution
Return type:: torch.Tensor

speechbrain.nnet.CNN.get_padding_elem(L_in: int, stride: int, kernel_size: int, dilation: int)[source]

This function computes the number of elements to add for zero-padding.

Parameters:

L_in (int)
stride (int)
kernel_size (int)
dilation (int)

Returns:

padding – The size of the padding to be added

Return type:

int

speechbrain.nnet.CNN.get_padding_elem_transposed(L_out: int, L_in: int, stride: int, kernel_size: int, dilation: int, output_padding: int)[source]

This function computes the required padding size for transposed convolution

Parameters:

L_out (int)
L_in (int)
stride (int)
kernel_size (int)
dilation (int)
output_padding (int)

Returns:

padding – The size of the padding to be applied

Return type:

int