speechbrain.lobes.models.dual_path module

Library to support dual-path speech separation.

Authors

Cem Subakan 2020
Mirco Ravanelli 2020
Samuele Cornell 2020
Mirko Bronzi 2020
Jianyuan Zhong 2020

Summary

Classes:

`CumulativeLayerNorm`	Calculate Cumulative Layer Normalization.
`DPTNetBlock`	The DPT Net block.
`Decoder`	A decoder layer that consists of ConvTranspose1d.
`Dual_Computation_Block`	Computation block for dual-path processing.
`Dual_Path_Model`	The dual path model which is the basis for dualpathrnn, sepformer, dptnet.
`Encoder`	Convolutional Encoder Layer.
`FastTransformerBlock`	This block is used to implement fast transformer models with efficient attention.
`GlobalLayerNorm`	Calculate Global Layer Normalization.
`IdentityBlock`	This block is used when we want to have identity transformation within the Dual_path block.
`PyTorchPositionalEncoding`	Positional encoder for the pytorch transformer.
`PytorchTransformerBlock`	A wrapper that uses the pytorch transformer block.
`SBConformerEncoderBlock`	A wrapper for the SpeechBrain implementation of the ConformerEncoder.
`SBRNNBlock`	RNNBlock for the dual path pipeline.
`SBTransformerBlock`	A wrapper for the SpeechBrain implementation of the transformer encoder.
`SepformerWrapper`	The wrapper for the sepformer model which combines the Encoder, Masknet and the decoder https://arxiv.org/abs/2010.13154

Functions:

select_norm

Just a wrapper to select the normalization type.

Reference

class speechbrain.lobes.models.dual_path.GlobalLayerNorm(dim, shape, eps=1e-08, elementwise_affine=True)[source]

Bases: Module

Calculate Global Layer Normalization.

Parameters:

dim ((int or list or torch.Size)) – Input shape from an expected input of size.
eps (float) – A value added to the denominator for numerical stability.
elementwise_affine (bool) – A boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases).

Example

>>> x = torch.randn(5, 10, 20)
>>> GLN = GlobalLayerNorm(10, 3)
>>> x_norm = GLN(x)

forward(x)[source]

Returns the normalized tensor.

Parameters:: x (torch.Tensor) – Tensor of size [N, C, K, S] or [N, C, L].

training: bool

class speechbrain.lobes.models.dual_path.CumulativeLayerNorm(dim, elementwise_affine=True, eps=1e-08)[source]

Bases: LayerNorm

Calculate Cumulative Layer Normalization.

dimint
Dimension that you want to normalize.

elementwise_affineTrue
Learnable per-element affine parameters.

Example

>>> x = torch.randn(5, 10, 20)
>>> CLN = CumulativeLayerNorm(10)
>>> x_norm = CLN(x)

forward(x)[source]

Returns the normalized tensor.

Parameters:: x (torch.Tensor) – Tensor size [N, C, K, S] or [N, C, L]

normalized_shape: Tuple[int, ...]

eps: float

elementwise_affine: bool

speechbrain.lobes.models.dual_path.select_norm(norm, dim, shape, eps=1e-08)[source]: Just a wrapper to select the normalization type.

class speechbrain.lobes.models.dual_path.Encoder(kernel_size=2, out_channels=64, in_channels=1)[source]

Bases: Module

Convolutional Encoder Layer.

Parameters:

kernel_size (int) – Length of filters.
in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.

Example

>>> x = torch.randn(2, 1000)
>>> encoder = Encoder(kernel_size=4, out_channels=64)
>>> h = encoder(x)
>>> h.shape
torch.Size([2, 64, 499])

forward(x)[source]

Return the encoded output.

Parameters:

x (torch.Tensor) – Input tensor with dimensionality [B, L].

Returns:

x (torch.Tensor) – Encoded tensor with dimensionality [B, N, T_out].
where B = Batchsize – L = Number of timepoints N = Number of filters T_out = Number of timepoints at the output of the encoder

training: bool

class speechbrain.lobes.models.dual_path.Decoder(*args, **kwargs)[source]

Bases: ConvTranspose1d

A decoder layer that consists of ConvTranspose1d.

Parameters:

kernel_size (int) – Length of filters.
in_channels (int) – Number of input channels.
out_channels (int) – Number of output channels.

Example

>>> x = torch.randn(2, 100, 1000)
>>> decoder = Decoder(kernel_size=4, in_channels=100, out_channels=1)
>>> h = decoder(x)
>>> h.shape
torch.Size([2, 1003])

forward(x)[source]

Return the decoded output.

Parameters:

x (torch.Tensor) –

Input tensor with dimensionality [B, N, L].

where, B = Batchsize,: N = number of filters L = time points

bias: Tensor | None

in_channels: int

out_channels: int

kernel_size: Tuple[int, ...]

stride: Tuple[int, ...]

padding: str | Tuple[int, ...]

dilation: Tuple[int, ...]

transposed: bool

output_padding: Tuple[int, ...]

groups: int

padding_mode: str

weight: Tensor

class speechbrain.lobes.models.dual_path.IdentityBlock[source]

Bases: object

This block is used when we want to have identity transformation within the Dual_path block.

Example

>>> x = torch.randn(10, 100)
>>> IB = IdentityBlock()
>>> xhat = IB(x)

class speechbrain.lobes.models.dual_path.FastTransformerBlock(attention_type, out_channels, num_layers=6, nhead=8, d_ffn=1024, dropout=0, activation='relu', reformer_bucket_size=32)[source]

Bases: Module

This block is used to implement fast transformer models with efficient attention.

The implementations are taken from https://fast-transformers.github.io/

Parameters:

attention_type (str) – Specifies the type of attention. Check https://fast-transformers.github.io/ for details.
out_channels (int) – Dimensionality of the representation.
num_layers (int) – Number of layers.
nhead (int) – Number of attention heads.
d_ffn (int) – Dimensionality of positional feed-forward.
dropout (float) – Dropout drop rate.
activation (str) – Activation function.
reformer_bucket_size (int) – bucket size for reformer.

Example

# >>> x = torch.randn(10, 100, 64) # >>> block = FastTransformerBlock(‘linear’, 64) # >>> x = block(x) # >>> x.shape # torch.Size([10, 100, 64])

forward(x)[source]

Returns the transformed input.

Parameters:

x (torch.Tensor) –

Tensor shaper [B, L, N]. where, B = Batchsize,

N = number of filters L = time points

training: bool

class speechbrain.lobes.models.dual_path.PyTorchPositionalEncoding(d_model, dropout=0.1, max_len=5000)[source]

Bases: Module

Positional encoder for the pytorch transformer.

Parameters:

d_model (int) – Representation dimensionality.
dropout (float) – Dropout drop prob.
max_len (int) – Max sequence length.

Example

>>> x = torch.randn(10, 100, 64)
>>> enc = PyTorchPositionalEncoding(64)
>>> x = enc(x)

forward(x)[source]

Returns the encoded output.

Parameters:

x (torch.Tensor) –

Tensor shape [B, L, N], where, B = Batchsize,

N = number of filters L = time points

training: bool

class speechbrain.lobes.models.dual_path.PytorchTransformerBlock(out_channels, num_layers=6, nhead=8, d_ffn=2048, dropout=0.1, activation='relu', use_positional_encoding=True)[source]

Bases: Module

A wrapper that uses the pytorch transformer block.

Parameters:

out_channels (int) – Dimensionality of the representation.
num_layers (int) – Number of layers.
nhead (int) – Number of attention heads.
d_ffn (int) – Dimensionality of positional feed forward.
Dropout (float) – Dropout drop rate.
activation (str) – Activation function.
use_positional_encoding (bool) – If true we use a positional encoding.

Example

>>> x = torch.randn(10, 100, 64)
>>> block = PytorchTransformerBlock(64)
>>> x = block(x)
>>> x.shape
torch.Size([10, 100, 64])

forward(x)[source]

Returns the transformed output.

Parameters:

x (torch.Tensor) –

Tensor shape [B, L, N] where, B = Batchsize,

N = number of filters L = time points

training: bool

class speechbrain.lobes.models.dual_path.SBTransformerBlock(num_layers, d_model, nhead, d_ffn=2048, input_shape=None, kdim=None, vdim=None, dropout=0.1, activation='relu', use_positional_encoding=False, norm_before=False, attention_type='regularMHA')[source]

Bases: Module

A wrapper for the SpeechBrain implementation of the transformer encoder.

Parameters:

num_layers (int) – Number of layers.
d_model (int) – Dimensionality of the representation.
nhead (int) – Number of attention heads.
d_ffn (int) – Dimensionality of positional feed forward.
input_shape (tuple) – Shape of input.
kdim (int) – Dimension of the key (Optional).
vdim (int) – Dimension of the value (Optional).
dropout (float) – Dropout rate.
activation (str) – Activation function.
use_positional_encoding (bool) – If true we use a positional encoding.
norm_before (bool) – Use normalization before transformations.

Example

>>> x = torch.randn(10, 100, 64)
>>> block = SBTransformerBlock(1, 64, 8)
>>> x = block(x)
>>> x.shape
torch.Size([10, 100, 64])

forward(x)[source]

Returns the transformed output.

Parameters:

x (torch.Tensor) –

Tensor shape [B, L, N], where, B = Batchsize,

L = time points N = number of filters

training: bool

class speechbrain.lobes.models.dual_path.SBRNNBlock(input_size, hidden_channels, num_layers, rnn_type='LSTM', dropout=0, bidirectional=True)[source]

Bases: Module

RNNBlock for the dual path pipeline.

Parameters:

input_size (int) – Dimensionality of the input features.
hidden_channels (int) – Dimensionality of the latent layer of the rnn.
num_layers (int) – Number of the rnn layers.
rnn_type (str) – Type of the the rnn cell.
dropout (float) – Dropout rate
bidirectional (bool) – If True, bidirectional.

Example

>>> x = torch.randn(10, 100, 64)
>>> rnn = SBRNNBlock(64, 100, 1, bidirectional=True)
>>> x = rnn(x)
>>> x.shape
torch.Size([10, 100, 200])

forward(x)[source]

Returns the transformed output.

Parameters:

x (torch.Tensor) –

[B, L, N] where, B = Batchsize,

N = number of filters L = time points

training: bool

class speechbrain.lobes.models.dual_path.DPTNetBlock(d_model, nhead, dim_feedforward=256, dropout=0, activation='relu')[source]

Bases: Module

The DPT Net block.

Parameters:

d_model (int) – Number of expected features in the input (required).
nhead (int) – Number of heads in the multiheadattention models (required).
dim_feedforward (int) – Dimension of the feedforward network model (default=2048).
dropout (float) – Dropout value (default=0.1).
activation (str) – Activation function of intermediate layer, relu or gelu (default=relu).

Examples

>>> encoder_layer = DPTNetBlock(d_model=512, nhead=8)
>>> src = torch.rand(10, 100, 512)
>>> out = encoder_layer(src)
>>> out.shape
torch.Size([10, 100, 512])

forward(src)[source]

Pass the input through the encoder layer.

Parameters:

src (torch.Tensor) –

Tensor shape [B, L, N] where, B = Batchsize,

N = number of filters L = time points

training: bool

class speechbrain.lobes.models.dual_path.Dual_Computation_Block(intra_mdl, inter_mdl, out_channels, norm='ln', skip_around_intra=True, linear_layer_after_inter_intra=True)[source]

Bases: Module

Computation block for dual-path processing.

Parameters:

intra_mdl (torch.nn.module) –

Model to process within the chunks. inter_mdl : torch.nn.module

Model to process across the chunks.

out_channelsint: Dimensionality of inter/intra model.
normstr: Normalization type.
skip_around_intrabool: Skip connection around the intra layer.
linear_layer_after_inter_intrabool: Linear layer or not after inter or intra.

Example

>>> intra_block = SBTransformerBlock(1, 64, 8)
>>> inter_block = SBTransformerBlock(1, 64, 8)
>>> dual_comp_block = Dual_Computation_Block(intra_block, inter_block, 64)
>>> x = torch.randn(10, 64, 100, 10)
>>> x = dual_comp_block(x)
>>> x.shape
torch.Size([10, 64, 100, 10])

forward(x)[source]

Returns the output tensor.

Parameters:

x (torch.Tensor) – Input tensor of dimension [B, N, K, S].

Returns:

out – Output tensor of dimension [B, N, K, S]. where, B = Batchsize,

N = number of filters K = time points in each chunk S = the number of chunks

Return type:

torch.Tensor

training: bool

class speechbrain.lobes.models.dual_path.Dual_Path_Model(in_channels, out_channels, intra_model, inter_model, num_layers=1, norm='ln', K=200, num_spks=2, skip_around_intra=True, linear_layer_after_inter_intra=True, use_global_pos_enc=False, max_length=20000)[source]

Bases: Module

The dual path model which is the basis for dualpathrnn, sepformer, dptnet.

Parameters:

in_channels (int) – Number of channels at the output of the encoder.
out_channels (int) – Number of channels that would be inputted to the intra and inter blocks.
intra_model (torch.nn.module) – Model to process within the chunks.
inter_model (torch.nn.module) – model to process across the chunks,
num_layers (int) – Number of layers of Dual Computation Block.
norm (str) – Normalization type.
K (int) – Chunk length.
num_spks (int) – Number of sources (speakers).
skip_around_intra (bool) – Skip connection around intra.
linear_layer_after_inter_intra (bool) – Linear layer after inter and intra.
use_global_pos_enc (bool) – Global positional encodings.
max_length (int) – Maximum sequence length.

Example

>>> intra_block = SBTransformerBlock(1, 64, 8)
>>> inter_block = SBTransformerBlock(1, 64, 8)
>>> dual_path_model = Dual_Path_Model(64, 64, intra_block, inter_block, num_spks=2)
>>> x = torch.randn(10, 64, 2000)
>>> x = dual_path_model(x)
>>> x.shape
torch.Size([2, 10, 64, 2000])

forward(x)[source]

Returns the output tensor.

Parameters:

x (torch.Tensor) – Input tensor of dimension [B, N, L].

Returns:

out – Output tensor of dimension [spks, B, N, L] where, spks = Number of speakers

B = Batchsize, N = number of filters L = the number of time points

Return type:

torch.Tensor

training: bool

class speechbrain.lobes.models.dual_path.SepformerWrapper(encoder_kernel_size=16, encoder_in_nchannels=1, encoder_out_nchannels=256, masknet_chunksize=250, masknet_numlayers=2, masknet_norm='ln', masknet_useextralinearlayer=False, masknet_extraskipconnection=True, masknet_numspks=2, intra_numlayers=8, inter_numlayers=8, intra_nhead=8, inter_nhead=8, intra_dffn=1024, inter_dffn=1024, intra_use_positional=True, inter_use_positional=True, intra_norm_before=True, inter_norm_before=True)[source]

Bases: Module

The wrapper for the sepformer model which combines the Encoder, Masknet and the decoder https://arxiv.org/abs/2010.13154

Parameters:

encoder_kernel_size (int,) – The kernel size used in the encoder
encoder_in_nchannels (int,) – The number of channels of the input audio
encoder_out_nchannels (int,) – The number of filters used in the encoder. Also, number of channels that would be inputted to the intra and inter blocks.
masknet_chunksize (int,) – The chunk length that is to be processed by the intra blocks
masknet_numlayers (int,) – The number of layers of combination of inter and intra blocks
masknet_norm (str,) –
The normalization type to be used in the masknet Should be one of ‘ln’ – layernorm, ‘gln’ – globallayernorm

’cln’ – cumulative layernorm, ‘bn’ – batchnorm – see the select_norm function above for more details
masknet_useextralinearlayer (bool,) – Whether or not to use a linear layer at the output of intra and inter blocks
masknet_extraskipconnection (bool,) – This introduces extra skip connections around the intra block
masknet_numspks (int,) – This determines the number of speakers to estimate
intra_numlayers (int,) – This determines the number of layers in the intra block
inter_numlayers (int,) – This determines the number of layers in the inter block
intra_nhead (int,) – This determines the number of parallel attention heads in the intra block
inter_nhead (int,) – This determines the number of parallel attention heads in the inter block
intra_dffn (int,) – The number of dimensions in the positional feedforward model in the inter block
inter_dffn (int,) – The number of dimensions in the positional feedforward model in the intra block
intra_use_positional (bool,) – Whether or not to use positional encodings in the intra block
inter_use_positional (bool,) – Whether or not to use positional encodings in the inter block
intra_norm_before (bool) – Whether or not we use normalization before the transformations in the intra block
inter_norm_before (bool) – Whether or not we use normalization before the transformations in the inter block

Example

>>> model = SepformerWrapper()
>>> inp = torch.rand(1, 160)
>>> result = model.forward(inp)
>>> result.shape
torch.Size([1, 160, 2])

reset_layer_recursively(layer)[source]: Reinitializes the parameters of the network

forward(mix)[source]: Processes the input tensor x and returns an output tensor.

training: bool

class speechbrain.lobes.models.dual_path.SBConformerEncoderBlock(num_layers, d_model, nhead, d_ffn=2048, input_shape=None, kdim=None, vdim=None, dropout=0.1, activation='swish', kernel_size=31, bias=True, use_positional_encoding=True, attention_type='RelPosMHAXL')[source]

Bases: Module

A wrapper for the SpeechBrain implementation of the ConformerEncoder.

Parameters:

num_layers (int) – Number of layers.
d_model (int) – Dimensionality of the representation.
nhead (int) – Number of attention heads.
d_ffn (int) – Dimensionality of positional feed forward.
input_shape (tuple) – Shape of input.
kdim (int) – Dimension of the key (Optional).
vdim (int) – Dimension of the value (Optional).
dropout (float) – Dropout rate.
activation (str) – Activation function.
kernel_size (int) – Kernel size in the conformer encoder
bias (bool) – Use bias or not in the convolution part of conformer encoder
use_positional_encoding (bool) – If true we use a positional encoding.

Example

>>> x = torch.randn(10, 100, 64)
>>> block = SBConformerEncoderBlock(1, 64, 8)
>>> from speechbrain.lobes.models.transformer.Transformer import PositionalEncoding
>>> pos_enc = PositionalEncoding(64)
>>> pos_embs = pos_enc(torch.ones(1, 199, 64))
>>> x = block(x)
>>> x.shape
torch.Size([10, 100, 64])

training: bool

forward(x)[source]

Returns the transformed output.

Parameters:

x (torch.Tensor) –

Tensor shape [B, L, N], where, B = Batchsize,

L = time points N = number of filters