speechbrain.nnet.CNN module¶

Library implementing convolutional neural networks.

Authors

Mirco Ravanelli 2020
Jianyuan Zhong 2020
Cem Subakan 2021

Summary¶

Classes:

`Conv1d`	This function implements 1d convolution.
`Conv2d`	This function implements 1d convolution.
`ConvTranspose1d`	This class implements 1d transposed convolution with speechbrain.
`DepthwiseSeparableConv1d`	This class implements the depthwise separable convolution.
`DepthwiseSeparableConv2d`	This class implements the depthwise separable convolution.
`SincConv`	This function implements SincConv (SincNet).

Functions:

`get_padding_elem`	This function computes the number of elements to add for zero-padding.
`get_padding_elem_transposed`	This function computes the required padding size for transposed convolution

Reference¶

class speechbrain.nnet.CNN.SincConv(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding='same', padding_mode='reflect', sample_rate=16000, min_low_hz=50, min_band_hz=50)[source]¶

Bases: torch.nn.modules.module.Module

This function implements SincConv (SincNet).

M. Ravanelli, Y. Bengio, “Speaker Recognition from raw waveform with SincNet”, in Proc. of SLT 2018 (https://arxiv.org/abs/1808.00158)

Parameters

input_shape (tuple) – The shape of the input. Alternatively use in_channels.
in_channels (int) – The number of input channels. Alternatively use input_shape.
out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.
padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.
groups (int) – This option specifies the convolutional groups. See torch.nn documentation for more information.
bias (bool) – If True, the additive bias b is adopted.
sample_rate (int,) – Sampling rate of the input signals. It is only used for sinc_conv.
min_low_hz (float) – Lowest possible frequency (in Hz) for a filter. It is only used for sinc_conv.
min_low_hz – Lowest possible value (in Hz) for a filter bandwidth.

Example

>>> inp_tensor = torch.rand([10, 16000])
>>> conv = SincConv(input_shape=inp_tensor.shape, out_channels=25, kernel_size=11)
>>> out_tensor = conv(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 16000, 25])

forward(x)[source]¶

Returns the output of the convolution.

Parameters: x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool¶

class speechbrain.nnet.CNN.Conv1d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding='same', groups=1, bias=True, padding_mode='reflect', skip_transpose=False)[source]¶

Bases: torch.nn.modules.module.Module

This function implements 1d convolution.

Parameters

out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
input_shape (tuple) – The shape of the input. Alternatively use in_channels.
in_channels (int) – The number of input channels. Alternatively use input_shape.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.
padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.
skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> inp_tensor = torch.rand([10, 40, 16])
>>> cnn_1d = Conv1d(
...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=5
... )
>>> out_tensor = cnn_1d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 40, 8])

forward(x)[source]¶

Returns the output of the convolution.

Parameters: x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool¶

class speechbrain.nnet.CNN.Conv2d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=(1, 1), dilation=(1, 1), padding='same', groups=1, bias=True, padding_mode='reflect')[source]¶

Bases: torch.nn.modules.module.Module

This function implements 1d convolution.

Parameters

out_channels (int) – It is the number of output channels.
kernel_size (tuple) – Kernel size of the 2d convolutional filters over time and frequency axis.
input_shape (tuple) – The shape of the input. Alternatively use in_channels.
in_channels (int) – The number of input channels. Alternatively use input_shape.
stride (int) – Stride factor of the 2d convolutional filters over time and frequency axis.
dilation (int) – Dilation factor of the 2d convolutional filters over time and frequency axis.
padding (str) – (same, valid). If “valid”, no padding is performed. If “same” and stride is 1, output shape is same as input shape.
padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.
groups (int) – This option specifies the convolutional groups. See torch.nn documentation for more information.
bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp_tensor = torch.rand([10, 40, 16, 8])
>>> cnn_2d = Conv2d(
...     input_shape=inp_tensor.shape, out_channels=5, kernel_size=(7, 3)
... )
>>> out_tensor = cnn_2d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 40, 16, 5])

forward(x)[source]¶

Returns the output of the convolution.

Parameters: x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool¶

class speechbrain.nnet.CNN.ConvTranspose1d(out_channels, kernel_size, input_shape=None, in_channels=None, stride=1, dilation=1, padding=0, output_padding=0, groups=1, bias=True, skip_transpose=False)[source]¶

Bases: torch.nn.modules.module.Module

This class implements 1d transposed convolution with speechbrain. Transpose convolution is normally used to perform upsampling.

Parameters

out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
input_shape (tuple) – The shape of the input. Alternatively use in_channels.
in_channels (int) – The number of input channels. Alternatively use input_shape.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, upsampling in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str or int) – To have in output the target dimension, we suggest tuning the kernel size and the padding properly. We also support the following function to have some control over the padding and the corresponding ouput dimensionality. if “valid”, no padding is applied if “same”, padding amount is inferred so that the output size is closest to possible to input size. Note that for some kernel_size / stride combinations it is not possible to obtain the exact same size, but we return the closest possible size. if “factor”, padding amount is inferred so that the output size is closest to inputsize*stride. Note that for some kernel_size / stride combinations it is not possible to obtain the exact size, but we return the closest possible size. if an integer value is entered, a custom padding is used.
output_padding (int,) – Additional size added to one side of the output shape
groups (int) – Number of blocked connections from input channels to output channels. Default: 1
bias (bool) – If True, adds a learnable bias to the output
skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> from speechbrain.nnet.CNN import Conv1d, ConvTranspose1d
>>> inp_tensor = torch.rand([10, 12, 40]) #[batch, time, fea]
>>> convtranspose_1d = ConvTranspose1d(
...     input_shape=inp_tensor.shape, out_channels=8, kernel_size=3, stride=2
... )
>>> out_tensor = convtranspose_1d(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 25, 8])

>>> # Combination of Conv1d and ConvTranspose1d
>>> from speechbrain.nnet.CNN import Conv1d, ConvTranspose1d
>>> signal = torch.tensor([1,100])
>>> signal = torch.rand([1,100]) #[batch, time]
>>> conv1d = Conv1d(input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2)
>>> conv_out = conv1d(signal)
>>> conv_t = ConvTranspose1d(input_shape=conv_out.shape, out_channels=1, kernel_size=3, stride=2, padding=1)
>>> signal_rec = conv_t(conv_out, output_size=[100])
>>> signal_rec.shape
torch.Size([1, 100])

>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2, padding='same')
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 115])

>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=7, stride=2, padding='valid')
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 235])

>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=7, stride=2, padding='factor')
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 231])

>>> signal = torch.rand([1,115]) #[batch, time]
>>> conv_t = ConvTranspose1d(input_shape=signal.shape, out_channels=1, kernel_size=3, stride=2, padding=10)
>>> signal_rec = conv_t(signal)
>>> signal_rec.shape
torch.Size([1, 211])

forward(x, output_size=None)[source]¶

Returns the output of the convolution.

Parameters: x (torch.Tensor (batch, time, channel)) – input to convolve. 2d or 4d tensors are expected.

training: bool¶

class speechbrain.nnet.CNN.DepthwiseSeparableConv1d(out_channels, kernel_size, input_shape, stride=1, dilation=1, padding='same', bias=True)[source]¶

Bases: torch.nn.modules.module.Module

This class implements the depthwise separable convolution.

First, a channel-wise convolution is applied to the input Then, a point-wise convolution to project the input to output

Parameters

out_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
input_shape (tuple) – Expected shape of the input.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.
padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.
bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp = torch.randn([8, 120, 40])
>>> conv = DepthwiseSeparableConv1d(256, 3, input_shape=inp.shape)
>>> out = conv(inp)
>>> out.shape
torch.Size([8, 120, 256])

forward(x)[source]¶

Returns the output of the convolution.

Parameters: x (torch.Tensor (batch, time, channel)) – input to convolve. 3d tensors are expected.

training: bool¶

class speechbrain.nnet.CNN.DepthwiseSeparableConv2d(out_channels, kernel_size, input_shape, stride=(1, 1), dilation=(1, 1), padding='same', bias=True)[source]¶

Bases: torch.nn.modules.module.Module

This class implements the depthwise separable convolution.

First, a channel-wise convolution is applied to the input Then, a point-wise convolution to project the input to output

Parameters

ut_channels (int) – It is the number of output channels.
kernel_size (int) – Kernel size of the convolutional filters.
stride (int) – Stride factor of the convolutional filters. When the stride factor > 1, a decimation in time is performed.
dilation (int) – Dilation factor of the convolutional filters.
padding (str) – (same, valid, causal). If “valid”, no padding is performed. If “same” and stride is 1, output shape is the same as the input shape. “causal” results in causal (dilated) convolutions.
padding_mode (str) – This flag specifies the type of padding. See torch.nn documentation for more information.
bias (bool) – If True, the additive bias b is adopted.

Example

>>> inp = torch.randn([8, 120, 40, 1])
>>> conv = DepthwiseSeparableConv2d(256, (3, 3), input_shape=inp.shape)
>>> out = conv(inp)
>>> out.shape
torch.Size([8, 120, 40, 256])

forward(x)[source]¶

Returns the output of the convolution.

Parameters: x (torch.Tensor (batch, time, channel)) – input to convolve. 3d tensors are expected.

training: bool¶

speechbrain.nnet.CNN.get_padding_elem(L_in: int, stride: int, kernel_size: int, dilation: int)[source]¶

This function computes the number of elements to add for zero-padding.

Parameters

L_in (int) –
stride (int) –
kernel_size (int) –
dilation (int) –

speechbrain.nnet.CNN.get_padding_elem_transposed(L_out: int, L_in: int, stride: int, kernel_size: int, dilation: int, output_padding: int)[source]¶

This function computes the required padding size for transposed convolution

Parameters

L_out (int) –
L_in (int) –
stride (int) –
kernel_size (int) –
dilation (int) –
output_padding (int) –