speechbrain.lobes.models.convolution module

This is a module to ensemble a convolution (depthwise) encoder with or without residual connection.

Authors
  • Jianyuan Zhong 2020

  • Titouan Parcollet 2023

  • Gianfranco Dumoulin Bertucci 2025

Summary

Classes:

ConvBlock

An implementation of convolution block with 1d or 2d convolutions (depthwise).

ConvolutionFrontEnd

This is a module to ensemble a convolution (depthwise) encoder with or without residual connection.

ConvolutionalSpatialGatingUnit

This module implementing CSGU as defined in: Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding"

Reference

class speechbrain.lobes.models.convolution.ConvolutionalSpatialGatingUnit(input_size: int, kernel_size: int = 31, dropout: float = 0.0, use_linear_after_conv: bool = False, activation: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.linear.Identity'>)[source]

Bases: Module

This module implementing CSGU as defined in: Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding”

The code is heavily inspired from the original ESPNet implementation.

Parameters:
  • input_size (int) – Size of the feature (channel) dimension.

  • kernel_size (int, optional (default=31)) – Size of the kernel.

  • dropout (float, optional (default=0.0)) – Dropout rate to be applied at the output.

  • use_linear_after_conv (bool, optional (default=False)) – If True, will apply a linear transformation of size input_size//2.

  • activation (Type[torch.nn.Module], optional (default=torch.nn.Identity)) – Activation function to use on the gate.

Example

>>> x = torch.rand((8, 30, 10))
>>> conv = ConvolutionalSpatialGatingUnit(input_size=x.shape[-1])
>>> out = conv(x)
>>> out.shape
torch.Size([8, 30, 5])
forward(x)[source]
Parameters:

x (torch.Tensor) – Input tensor, shape (B, T, D)

Returns:

out – The processed outputs.

Return type:

torch.Tensor

class speechbrain.lobes.models.convolution.ConvolutionFrontEnd(input_shape: ~typing.Iterable, num_blocks: int = 3, num_layers_per_block: int = 5, out_channels: ~typing.List[int] = [128, 256, 512], kernel_sizes: ~typing.List[int] = [3, 3, 3], strides: ~typing.List[int] = [1, 2, 2], dilations: ~typing.List[int] = [1, 1, 1], residuals: ~typing.List[bool] = [True, True, True], conv_module: ~typing.Type[~torch.nn.modules.module.Module] = <class 'speechbrain.nnet.CNN.Conv2d'>, activation: ~typing.Callable = <class 'torch.nn.modules.activation.LeakyReLU'>, norm: ~typing.Type[~torch.nn.modules.module.Module] | None = <class 'speechbrain.nnet.normalization.LayerNorm'>, dropout: float = 0.1, conv_bias: bool = True, padding: ~typing.Literal['same', 'valid', 'causal'] = 'same', conv_init: str | None = None)[source]

Bases: Sequential

This is a module to ensemble a convolution (depthwise) encoder with or without residual connection.

Parameters:
  • input_shape (Iterable) – Expected shape of the input tensor.

  • num_blocks (int, optional (default=3)) – Number of blocks.

  • num_layers_per_block (int, optional (default=5)) – Number of convolution layers for each block.

  • out_channels (List[int], optional (default=[128, 256, 512])) – Number of output channels for each block.

  • kernel_sizes (List[int], optional (default=[3, 3, 3])) – Kernel size of convolution blocks.

  • strides (List[int], optional (default=[1, 2, 2])) – Striding factor for each block, applied at the last layer.

  • dilations (List[int], optional (default=[1, 1, 1])) – Dilation factor for each block.

  • residuals (List[bool], optional (default=[True, True, True])) – Whether to apply residual connection at each block.

  • conv_module (Type[torch.nn.Module], optional (default=sb.nnet.Conv2d)) – Class to use for constructing conv layers.

  • activation (Callable, optional (default=torch.nn.LeakyReLU)) – Activation function for each block.

  • norm (Optional[Type[torch.nn.Module]] (default=LayerNorm)) – Normalization to regularize the model.

  • dropout (float, optional (default=0.1)) – Dropout probability.

  • conv_bias (bool, optional (default=True)) – Whether to add a bias term to convolutional layers.

  • padding (Literal["same", "valid", "causal"], optional (default="same")) – Type of padding to apply.

  • conv_init (Optional[str], optional (default=None=zeros)) – Type of initialization to use for conv layers.

Example

>>> x = torch.rand((8, 30, 10))
>>> conv = ConvolutionFrontEnd(input_shape=x.shape)
>>> out = conv(x)
>>> out.shape
torch.Size([8, 8, 3, 512])
get_filter_properties() FilterProperties[source]
class speechbrain.lobes.models.convolution.ConvBlock(num_layers: int, out_channels: int, input_shape: ~typing.Iterable, kernel_size: int = 3, stride: int = 1, dilation: int = 1, residual: bool = False, conv_module: ~typing.Type[~torch.nn.modules.module.Module] = <class 'speechbrain.nnet.CNN.Conv2d'>, activation: ~typing.Callable = <class 'torch.nn.modules.activation.LeakyReLU'>, norm: ~typing.Type[~torch.nn.modules.module.Module] | None = None, dropout: float = 0.1, conv_bias: bool = True, padding: ~typing.Literal['same', 'valid', 'causal'] = 'same', conv_init: str | None = None)[source]

Bases: Module

An implementation of convolution block with 1d or 2d convolutions (depthwise).

Parameters:
  • num_layers (int) – Number of depthwise convolution layers for this block.

  • out_channels (int) – Number of output channels of this model.

  • input_shape (Iterable) – Expected shape of the input tensor.

  • kernel_size (int, optional (default=3)) – Kernel size of convolution layers.

  • stride (int, optional (default=1)) – Striding factor for this block.

  • dilation (int, optional (default=1)) – Dilation factor.

  • residual (bool, optional (default=False)) – Add a residual connection if True.

  • conv_module (Type[torch.nn.Module], optional (default=sb.nnet.Conv2d)) – Class to use when constructing conv layers.

  • activation (Callable, optional (default=torch.nn.LeakyReLU)) – Activation function for this block.

  • norm (Optional[Type[torch.nn.Module]] (default=None)) – Normalization to regularize the model.

  • dropout (float, optional (default=0.1)) – Rate to zero outputs at.

  • conv_bias (bool, optional (default=True)) – Add a bias term to conv layers.

  • padding (Literal["same", "valid", "causal"], optional (default="same")) – The type of padding to add.

  • conv_init (Optional[str], optional (default=None=zeros)) – Type of initialization to use for conv layers.

Example

>>> x = torch.rand((8, 30, 10))
>>> conv = ConvBlock(2, 16, input_shape=x.shape)
>>> out = conv(x)
>>> x.shape
torch.Size([8, 30, 10])
forward(x)[source]

Processes the input tensor x and returns an output tensor.

get_filter_properties() FilterProperties[source]