speechbrain.nnet.normalization module

Library implementing normalization.

Authors
  • Mirco Ravanelli 2020

  • Guillermo Cámbara 2021

  • Sarthak Yadav 2022

Summary

Classes:

BatchNorm1d

Applies 1d batch normalization to the input tensor.

BatchNorm2d

Applies 2d batch normalization to the input tensor.

ExponentialMovingAverage

Applies learnable exponential moving average, as required by learnable PCEN layer

GroupNorm

Applies group normalization to the input tensor.

InstanceNorm1d

Applies 1d instance normalization to the input tensor.

InstanceNorm2d

Applies 2d instance normalization to the input tensor.

LayerNorm

Applies layer normalization to the input tensor.

PCEN

This class implements a learnable Per-channel energy normalization (PCEN) layer, supporting both original PCEN as specified in [1] as well as sPCEN as specified in [2]

Reference

class speechbrain.nnet.normalization.BatchNorm1d(input_shape=None, input_size=None, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, combine_batch_time=False, skip_transpose=False)[source]

Bases: Module

Applies 1d batch normalization to the input tensor.

Parameters:
  • input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.

  • input_size (int) – The expected size of the input. Alternatively, use input_shape.

  • eps (float) – This value is added to std deviation estimation to improve the numerical stability.

  • momentum (float) – It is a value used for the running_mean and running_var computation.

  • affine (bool) – When set to True, the affine parameters are learned.

  • track_running_stats (bool) – When set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics.

  • combine_batch_time (bool) – When true, it combines batch an time axis.

Example

>>> input = torch.randn(100, 10)
>>> norm = BatchNorm1d(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 10])
forward(x)[source]

Returns the normalized input tensor.

Parameters:

x (torch.Tensor (batch, time, [channels])) – input to normalize. 2d or 3d tensors are expected in input 4d tensors can be used when combine_dims=True.

training: bool
class speechbrain.nnet.normalization.BatchNorm2d(input_shape=None, input_size=None, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)[source]

Bases: Module

Applies 2d batch normalization to the input tensor.

Parameters:
  • input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.

  • input_size (int) – The expected size of the input. Alternatively, use input_shape.

  • eps (float) – This value is added to std deviation estimation to improve the numerical stability.

  • momentum (float) – It is a value used for the running_mean and running_var computation.

  • affine (bool) – When set to True, the affine parameters are learned.

  • track_running_stats (bool) – When set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics.

Example

>>> input = torch.randn(100, 10, 5, 20)
>>> norm = BatchNorm2d(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 10, 5, 20])
forward(x)[source]

Returns the normalized input tensor.

Parameters:

x (torch.Tensor (batch, time, channel1, channel2)) – input to normalize. 4d tensors are expected.

training: bool
class speechbrain.nnet.normalization.LayerNorm(input_size=None, input_shape=None, eps=1e-05, elementwise_affine=True)[source]

Bases: Module

Applies layer normalization to the input tensor.

Parameters:
  • input_shape (tuple) – The expected shape of the input.

  • eps (float) – This value is added to std deviation estimation to improve the numerical stability.

  • elementwise_affine (bool) – If True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases).

Example

>>> input = torch.randn(100, 101, 128)
>>> norm = LayerNorm(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 101, 128])
forward(x)[source]

Returns the normalized input tensor.

Parameters:

x (torch.Tensor (batch, time, channels)) – input to normalize. 3d or 4d tensors are expected.

training: bool
class speechbrain.nnet.normalization.InstanceNorm1d(input_shape=None, input_size=None, eps=1e-05, momentum=0.1, track_running_stats=True, affine=False)[source]

Bases: Module

Applies 1d instance normalization to the input tensor.

Parameters:
  • input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.

  • input_size (int) – The expected size of the input. Alternatively, use input_shape.

  • eps (float) – This value is added to std deviation estimation to improve the numerical stability.

  • momentum (float) – It is a value used for the running_mean and running_var computation.

  • track_running_stats (bool) – When set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics.

  • affine (bool) – A boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.

Example

>>> input = torch.randn(100, 10, 20)
>>> norm = InstanceNorm1d(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 10, 20])
forward(x)[source]

Returns the normalized input tensor.

Parameters:

x (torch.Tensor (batch, time, channels)) – input to normalize. 3d tensors are expected.

training: bool
class speechbrain.nnet.normalization.InstanceNorm2d(input_shape=None, input_size=None, eps=1e-05, momentum=0.1, track_running_stats=True, affine=False)[source]

Bases: Module

Applies 2d instance normalization to the input tensor.

Parameters:
  • input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.

  • input_size (int) – The expected size of the input. Alternatively, use input_shape.

  • eps (float) – This value is added to std deviation estimation to improve the numerical stability.

  • momentum (float) – It is a value used for the running_mean and running_var computation.

  • track_running_stats (bool) – When set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics.

  • affine (bool) – A boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.

Example

>>> input = torch.randn(100, 10, 20, 2)
>>> norm = InstanceNorm2d(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 10, 20, 2])
forward(x)[source]

Returns the normalized input tensor.

Parameters:

x (torch.Tensor (batch, time, channel1, channel2)) – input to normalize. 4d tensors are expected.

training: bool
class speechbrain.nnet.normalization.GroupNorm(input_shape=None, input_size=None, num_groups=None, eps=1e-05, affine=True)[source]

Bases: Module

Applies group normalization to the input tensor.

Parameters:
  • input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.

  • input_size (int) – The expected size of the input. Alternatively, use input_shape.

  • num_groups (int) – Number of groups to separate the channels into.

  • eps (float) – This value is added to std deviation estimation to improve the numerical stability.

  • affine (bool) – A boolean value that when set to True, this module has learnable per-channel affine parameters initialized to ones (for weights) and zeros (for biases).

Example

>>> input = torch.randn(100, 101, 128)
>>> norm = GroupNorm(input_size=128, num_groups=128)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 101, 128])
forward(x)[source]

Returns the normalized input tensor.

Parameters:

x (torch.Tensor (batch, time, channels)) – input to normalize. 3d or 4d tensors are expected.

training: bool
class speechbrain.nnet.normalization.ExponentialMovingAverage(input_size: int, coeff_init: float = 0.04, per_channel: bool = False, trainable: bool = True, skip_transpose: bool = False)[source]

Bases: Module

Applies learnable exponential moving average, as required by learnable PCEN layer

Parameters:
  • input_size (int) – The expected size of the input.

  • coeff_init (float) – Initial smoothing coefficient value

  • per_channel (bool) – Controls whether every smoothing coefficients are learned independently for every input channel

  • trainable (bool) – whether to learn the PCEN parameters or use fixed

  • skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> inp_tensor = torch.rand([10, 50, 40])
>>> pcen = ExponentialMovingAverage(40)
>>> out_tensor = pcen(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 50, 40])
forward(x)[source]

Returns the normalized input tensor.

Arguments
xtorch.Tensor (batch, time, channels)

input to normalize.

training: bool
class speechbrain.nnet.normalization.PCEN(input_size, alpha: float = 0.96, smooth_coef: float = 0.04, delta: float = 2.0, root: float = 2.0, floor: float = 1e-12, trainable: bool = True, per_channel_smooth_coef: bool = True, skip_transpose: bool = False)[source]

Bases: Module

This class implements a learnable Per-channel energy normalization (PCEN) layer, supporting both original PCEN as specified in [1] as well as sPCEN as specified in [2]

[1] Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous, “Trainable Frontend For Robust and Far-Field Keyword Spotting”, in Proc of ICASSP 2017 (https://arxiv.org/abs/1607.05666)

[2] Neil Zeghidour, Olivier Teboul, F{‘e}lix de Chaumont Quitry & Marco Tagliasacchi, “LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION”, in Proc of ICLR 2021 (https://arxiv.org/abs/2101.08596)

The default argument values correspond with those used by [2].

Parameters:
  • input_size (int) – The expected size of the input.

  • alpha (float) – specifies alpha coefficient for PCEN

  • smooth_coef (float) – specified smooth coefficient for PCEN

  • delta (float) – specifies delta coefficient for PCEN

  • root (float) – specifies root coefficient for PCEN

  • floor (float) – specifies floor coefficient for PCEN

  • trainable (bool) – whether to learn the PCEN parameters or use fixed

  • per_channel_smooth_coef (bool) – whether to learn independent smooth coefficients for every channel. when True, essentially using sPCEN from [2]

  • skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> inp_tensor = torch.rand([10, 50, 40])
>>> pcen = PCEN(40, alpha=0.96)         # sPCEN
>>> out_tensor = pcen(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 50, 40])
forward(x)[source]

Returns the normalized input tensor.

Parameters:

x (torch.Tensor (batch, time, channels)) – input to normalize.

training: bool