speechbrain.nnet.normalization module

Library implementing normalization.

Authors

Mirco Ravanelli 2020
Guillermo Cámbara 2021
Sarthak Yadav 2022

Summary

Classes:

`BatchNorm1d`	Applies 1d batch normalization to the input tensor.
`BatchNorm2d`	Applies 2d batch normalization to the input tensor.
`ExponentialMovingAverage`	Applies learnable exponential moving average, as required by learnable PCEN layer
`GroupNorm`	Applies group normalization to the input tensor.
`InstanceNorm1d`	Applies 1d instance normalization to the input tensor.
`InstanceNorm2d`	Applies 2d instance normalization to the input tensor.
`LayerNorm`	Applies layer normalization to the input tensor.
`PCEN`	This class implements a learnable Per-channel energy normalization (PCEN) layer, supporting both original PCEN as specified in [1] as well as sPCEN as specified in [2]

Reference

class speechbrain.nnet.normalization.BatchNorm1d(input_shape=None, input_size=None, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, combine_batch_time=False, skip_transpose=False)[source]

Bases: Module

Applies 1d batch normalization to the input tensor.

Parameters:

input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.
input_size (int) – The expected size of the input. Alternatively, use input_shape.
eps (float) – This value is added to std deviation estimation to improve the numerical stability.
momentum (float) – It is a value used for the running_mean and running_var computation.
affine (bool) – When set to True, the affine parameters are learned.
track_running_stats (bool) – When set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics.
combine_batch_time (bool) – When true, it combines batch an time axis.
skip_transpose (bool) – Whether to skip the transposition.

Example

>>> input = torch.randn(100, 10)
>>> norm = BatchNorm1d(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 10])

forward(x)[source]

Returns the normalized input tensor.

Parameters:: x (torch.Tensor (batch, time, [channels])) – input to normalize. 2d or 3d tensors are expected in input 4d tensors can be used when combine_dims=True.
Returns:: x_n – The normalized outputs.
Return type:: torch.Tensor

class speechbrain.nnet.normalization.BatchNorm2d(input_shape=None, input_size=None, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)[source]

Bases: Module

Applies 2d batch normalization to the input tensor.

Parameters:

input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.
input_size (int) – The expected size of the input. Alternatively, use input_shape.
eps (float) – This value is added to std deviation estimation to improve the numerical stability.
momentum (float) – It is a value used for the running_mean and running_var computation.
affine (bool) – When set to True, the affine parameters are learned.
track_running_stats (bool) – When set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics.

Example

>>> input = torch.randn(100, 10, 5, 20)
>>> norm = BatchNorm2d(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 10, 5, 20])

forward(x)[source]

Returns the normalized input tensor.

Parameters:: x (torch.Tensor (batch, time, channel1, channel2)) – input to normalize. 4d tensors are expected.
Returns:: x_n – The normalized outputs.
Return type:: torch.Tensor

class speechbrain.nnet.normalization.LayerNorm(input_size=None, input_shape=None, eps=1e-05, elementwise_affine=True)[source]

Bases: Module

Applies layer normalization to the input tensor.

Parameters:

input_size (int) – The expected size of the dimension to be normalized.
input_shape (tuple) – The expected shape of the input.
eps (float) – This value is added to std deviation estimation to improve the numerical stability.
elementwise_affine (bool) – If True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases).

Example

>>> input = torch.randn(100, 101, 128)
>>> norm = LayerNorm(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 101, 128])

forward(x)[source]

Returns the normalized input tensor.

Parameters:: x (torch.Tensor (batch, time, channels)) – input to normalize. 3d or 4d tensors are expected.
Return type:: The normalized outputs.

class speechbrain.nnet.normalization.InstanceNorm1d(input_shape=None, input_size=None, eps=1e-05, momentum=0.1, track_running_stats=True, affine=False)[source]

Bases: Module

Applies 1d instance normalization to the input tensor.

Parameters:

input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.
input_size (int) – The expected size of the input. Alternatively, use input_shape.
eps (float) – This value is added to std deviation estimation to improve the numerical stability.
momentum (float) – It is a value used for the running_mean and running_var computation.
track_running_stats (bool) – When set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics.
affine (bool) – A boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.

Example

>>> input = torch.randn(100, 10, 20)
>>> norm = InstanceNorm1d(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 10, 20])

forward(x)[source]

Returns the normalized input tensor.

Parameters:: x (torch.Tensor (batch, time, channels)) – input to normalize. 3d tensors are expected.
Returns:: x_n – The normalized outputs.
Return type:: torch.Tensor

class speechbrain.nnet.normalization.InstanceNorm2d(input_shape=None, input_size=None, eps=1e-05, momentum=0.1, track_running_stats=True, affine=False)[source]

Bases: Module

Applies 2d instance normalization to the input tensor.

Parameters:

input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.
input_size (int) – The expected size of the input. Alternatively, use input_shape.
eps (float) – This value is added to std deviation estimation to improve the numerical stability.
momentum (float) – It is a value used for the running_mean and running_var computation.
track_running_stats (bool) – When set to True, this module tracks the running mean and variance, and when set to False, this module does not track such statistics.
affine (bool) – A boolean value that when set to True, this module has learnable affine parameters, initialized the same way as done for batch normalization. Default: False.

Example

>>> input = torch.randn(100, 10, 20, 2)
>>> norm = InstanceNorm2d(input_shape=input.shape)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 10, 20, 2])

forward(x)[source]

Returns the normalized input tensor.

Parameters:: x (torch.Tensor (batch, time, channel1, channel2)) – input to normalize. 4d tensors are expected.
Returns:: x_n – The normalized outputs.
Return type:: torch.Tensor

class speechbrain.nnet.normalization.GroupNorm(input_shape=None, input_size=None, num_groups=None, eps=1e-05, affine=True)[source]

Bases: Module

Applies group normalization to the input tensor.

Parameters:

input_shape (tuple) – The expected shape of the input. Alternatively, use input_size.
input_size (int) – The expected size of the input. Alternatively, use input_shape.
num_groups (int) – Number of groups to separate the channels into.
eps (float) – This value is added to std deviation estimation to improve the numerical stability.
affine (bool) – A boolean value that when set to True, this module has learnable per-channel affine parameters initialized to ones (for weights) and zeros (for biases).

Example

>>> input = torch.randn(100, 101, 128)
>>> norm = GroupNorm(input_size=128, num_groups=128)
>>> output = norm(input)
>>> output.shape
torch.Size([100, 101, 128])

forward(x)[source]

Returns the normalized input tensor.

Parameters:: x (torch.Tensor (batch, time, channels)) – input to normalize. 3d or 4d tensors are expected.
Returns:: x_n – The normalized outputs.
Return type:: torch.Tensor

class speechbrain.nnet.normalization.ExponentialMovingAverage(input_size: int, coeff_init: float = 0.04, per_channel: bool = False, trainable: bool = True, skip_transpose: bool = False)[source]

Bases: Module

Applies learnable exponential moving average, as required by learnable PCEN layer

Parameters:

input_size (int) – The expected size of the input.
coeff_init (float) – Initial smoothing coefficient value
per_channel (bool) – Controls whether every smoothing coefficients are learned independently for every input channel
trainable (bool) – whether to learn the PCEN parameters or use fixed
skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> inp_tensor = torch.rand([10, 50, 40])
>>> pcen = ExponentialMovingAverage(40)
>>> out_tensor = pcen(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 50, 40])

forward(x)[source]

Returns the normalized input tensor.

Arguments

xtorch.Tensor (batch, time, channels): input to normalize.

class speechbrain.nnet.normalization.PCEN(input_size, alpha: float = 0.96, smooth_coef: float = 0.04, delta: float = 2.0, root: float = 2.0, floor: float = 1e-12, trainable: bool = True, per_channel_smooth_coef: bool = True, skip_transpose: bool = False)[source]

Bases: Module

This class implements a learnable Per-channel energy normalization (PCEN) layer, supporting both original PCEN as specified in [1] as well as sPCEN as specified in [2]

[1] Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F. Lyon, Rif A. Saurous, “Trainable Frontend For Robust and Far-Field Keyword Spotting”, in Proc of ICASSP 2017 (https://arxiv.org/abs/1607.05666)

[2] Neil Zeghidour, Olivier Teboul, F{‘e}lix de Chaumont Quitry & Marco Tagliasacchi, “LEAF: A LEARNABLE FRONTEND FOR AUDIO CLASSIFICATION”, in Proc of ICLR 2021 (https://arxiv.org/abs/2101.08596)

The default argument values correspond with those used by [2].

Parameters:

input_size (int) – The expected size of the input.
alpha (float) – specifies alpha coefficient for PCEN
smooth_coef (float) – specified smooth coefficient for PCEN
delta (float) – specifies delta coefficient for PCEN
root (float) – specifies root coefficient for PCEN
floor (float) – specifies floor coefficient for PCEN
trainable (bool) – whether to learn the PCEN parameters or use fixed
per_channel_smooth_coef (bool) – whether to learn independent smooth coefficients for every channel. when True, essentially using sPCEN from [2]
skip_transpose (bool) – If False, uses batch x time x channel convention of speechbrain. If True, uses batch x channel x time convention.

Example

>>> inp_tensor = torch.rand([10, 50, 40])
>>> pcen = PCEN(40, alpha=0.96)  # sPCEN
>>> out_tensor = pcen(inp_tensor)
>>> out_tensor.shape
torch.Size([10, 50, 40])

forward(x)[source]

Returns the normalized input tensor.

Parameters:: x (torch.Tensor (batch, time, channels)) – input to normalize.
Returns:: output – The normalized outputs.
Return type:: torch.Tensor