speechbrain.lobes.models.dual_path module

Library to support dual-path speech separation.

Authors
  • Cem Subakan 2020

  • Mirco Ravanelli 2020

  • Samuele Cornell 2020

  • Mirko Bronzi 2020

  • Jianyuan Zhong 2020

Summary

Classes:

CumulativeLayerNorm

Calculate Cumulative Layer Normalization.

DPTNetBlock

The DPT Net block.

Decoder

A decoder layer that consists of ConvTranspose1d.

Dual_Computation_Block

Computation block for dual-path processing.

Dual_Path_Model

The dual path model which is the basis for dualpathrnn, sepformer, dptnet.

Encoder

Convolutional Encoder Layer.

FastTransformerBlock

This block is used to implement fast transformer models with efficient attention.

GlobalLayerNorm

Calculate Global Layer Normalization.

IdentityBlock

This block is used when we want to have identity transformation within the Dual_path block.

PyTorchPositionalEncoding

Positional encoder for the pytorch transformer.

PytorchTransformerBlock

A wrapper that uses the pytorch transformer block.

SBConformerEncoderBlock

A wrapper for the SpeechBrain implementation of the ConformerEncoder.

SBRNNBlock

RNNBlock for the dual path pipeline.

SBTransformerBlock

A wrapper for the SpeechBrain implementation of the transformer encoder.

SepformerWrapper

The wrapper for the sepformer model which combines the Encoder, Masknet and the decoder https://arxiv.org/abs/2010.13154

Functions:

select_norm

Just a wrapper to select the normalization type.

Reference

class speechbrain.lobes.models.dual_path.GlobalLayerNorm(dim, shape, eps=1e-08, elementwise_affine=True)[source]

Bases: torch.nn.modules.module.Module

Calculate Global Layer Normalization.

Parameters
  • dim ((int or list or torch.Size)) – Input shape from an expected input of size.

  • eps (float) – A value added to the denominator for numerical stability.

  • elementwise_affine (bool) – A boolean value that when set to True, this module has learnable per-element affine parameters initialized to ones (for weights) and zeros (for biases).

Example

>>> x = torch.randn(5, 10, 20)
>>> GLN = GlobalLayerNorm(10, 3)
>>> x_norm = GLN(x)
forward(x)[source]

Returns the normalized tensor.

Parameters

x (torch.Tensor) – Tensor of size [N, C, K, S] or [N, C, L].

training: bool
class speechbrain.lobes.models.dual_path.CumulativeLayerNorm(dim, elementwise_affine=True)[source]

Bases: torch.nn.modules.normalization.LayerNorm

Calculate Cumulative Layer Normalization.

dimint

Dimension that you want to normalize.

elementwise_affineTrue

Learnable per-element affine parameters.

Example

>>> x = torch.randn(5, 10, 20)
>>> CLN = CumulativeLayerNorm(10)
>>> x_norm = CLN(x)
forward(x)[source]

Returns the normalized tensor.

Parameters

x (torch.Tensor) – Tensor size [N, C, K, S] or [N, C, L]

normalized_shape: Tuple[int, ]
eps: float
elementwise_affine: bool
speechbrain.lobes.models.dual_path.select_norm(norm, dim, shape)[source]

Just a wrapper to select the normalization type.

class speechbrain.lobes.models.dual_path.Encoder(kernel_size=2, out_channels=64, in_channels=1)[source]

Bases: torch.nn.modules.module.Module

Convolutional Encoder Layer.

Parameters
  • kernel_size (int) – Length of filters.

  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

Example

>>> x = torch.randn(2, 1000)
>>> encoder = Encoder(kernel_size=4, out_channels=64)
>>> h = encoder(x)
>>> h.shape
torch.Size([2, 64, 499])
forward(x)[source]

Return the encoded output.

Parameters

x (torch.Tensor) – Input tensor with dimensionality [B, L].

Returns

  • x (torch.Tensor) – Encoded tensor with dimensionality [B, N, T_out].

  • where B = Batchsize – L = Number of timepoints N = Number of filters T_out = Number of timepoints at the output of the encoder

training: bool
class speechbrain.lobes.models.dual_path.Decoder(*args, **kwargs)[source]

Bases: torch.nn.modules.conv.ConvTranspose1d

A decoder layer that consists of ConvTranspose1d.

Parameters
  • kernel_size (int) – Length of filters.

  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

Example

>>> x = torch.randn(2, 100, 1000)
>>> decoder = Decoder(kernel_size=4, in_channels=100, out_channels=1)
>>> h = decoder(x)
>>> h.shape
torch.Size([2, 1003])
forward(x)[source]

Return the decoded output.

Parameters

x (torch.Tensor) –

Input tensor with dimensionality [B, N, L].
where, B = Batchsize,

N = number of filters L = time points

bias: Optional[torch.Tensor]
out_channels: int
kernel_size: Tuple[int, ]
stride: Tuple[int, ]
padding: Tuple[int, ]
dilation: Tuple[int, ]
transposed: bool
output_padding: Tuple[int, ]
groups: int
padding_mode: str
weight: torch.Tensor
class speechbrain.lobes.models.dual_path.IdentityBlock[source]

Bases: object

This block is used when we want to have identity transformation within the Dual_path block.

Example

>>> x = torch.randn(10, 100)
>>> IB = IdentityBlock()
>>> xhat = IB(x)
class speechbrain.lobes.models.dual_path.FastTransformerBlock(attention_type, out_channels, num_layers=6, nhead=8, d_ffn=1024, dropout=0, activation='relu', reformer_bucket_size=32)[source]

Bases: torch.nn.modules.module.Module

This block is used to implement fast transformer models with efficient attention.

The implementations are taken from https://fast-transformers.github.io/

Parameters
  • attention_type (str) – Specifies the type of attention. Check https://fast-transformers.github.io/ for details.

  • out_channels (int) – Dimensionality of the representation.

  • num_layers (int) – Number of layers.

  • nhead (int) – Number of attention heads.

  • d_ffn (int) – Dimensionality of positional feed-forward.

  • dropout (float) – Dropout drop rate.

  • activation (str) – Activation function.

  • reformer_bucket_size (int) – bucket size for reformer.

Example

# >>> x = torch.randn(10, 100, 64) # >>> block = FastTransformerBlock(‘linear’, 64) # >>> x = block(x) # >>> x.shape # torch.Size([10, 100, 64])

forward(x)[source]

Returns the transformed input.

Parameters

x (torch.Tensor) –

Tensor shaper [B, L, N]. where, B = Batchsize,

N = number of filters L = time points

training: bool
class speechbrain.lobes.models.dual_path.PyTorchPositionalEncoding(d_model, dropout=0.1, max_len=5000)[source]

Bases: torch.nn.modules.module.Module

Positional encoder for the pytorch transformer.

Parameters
  • d_model (int) – Representation dimensionality.

  • dropout (float) – Dropout drop prob.

  • max_len (int) – Max sequence length.

Example

>>> x = torch.randn(10, 100, 64)
>>> enc = PyTorchPositionalEncoding(64)
>>> x = enc(x)
forward(x)[source]

Returns the encoded output.

Parameters

x (torch.Tensor) –

Tensor shape [B, L, N], where, B = Batchsize,

N = number of filters L = time points

training: bool
class speechbrain.lobes.models.dual_path.PytorchTransformerBlock(out_channels, num_layers=6, nhead=8, d_ffn=2048, dropout=0.1, activation='relu', use_positional_encoding=True)[source]

Bases: torch.nn.modules.module.Module

A wrapper that uses the pytorch transformer block.

Parameters
  • out_channels (int) – Dimensionality of the representation.

  • num_layers (int) – Number of layers.

  • nhead (int) – Number of attention heads.

  • d_ffn (int) – Dimensionality of positional feed forward.

  • Dropout (float) – Dropout drop rate.

  • activation (str) – Activation function.

  • use_positional_encoding (bool) – If true we use a positional encoding.

Example

>>> x = torch.randn(10, 100, 64)
>>> block = PytorchTransformerBlock(64)
>>> x = block(x)
>>> x.shape
torch.Size([10, 100, 64])
forward(x)[source]

Returns the transformed output.

Parameters

x (torch.Tensor) –

Tensor shape [B, L, N] where, B = Batchsize,

N = number of filters L = time points

training: bool
class speechbrain.lobes.models.dual_path.SBTransformerBlock(num_layers, d_model, nhead, d_ffn=2048, input_shape=None, kdim=None, vdim=None, dropout=0.1, activation='relu', use_positional_encoding=False, norm_before=False)[source]

Bases: torch.nn.modules.module.Module

A wrapper for the SpeechBrain implementation of the transformer encoder.

Parameters
  • num_layers (int) – Number of layers.

  • d_model (int) – Dimensionality of the representation.

  • nhead (int) – Number of attention heads.

  • d_ffn (int) – Dimensionality of positional feed forward.

  • input_shape (tuple) – Shape of input.

  • kdim (int) – Dimension of the key (Optional).

  • vdim (int) – Dimension of the value (Optional).

  • dropout (float) – Dropout rate.

  • activation (str) – Activation function.

  • use_positional_encoding (bool) – If true we use a positional encoding.

  • norm_before (bool) – Use normalization before transformations.

Example

>>> x = torch.randn(10, 100, 64)
>>> block = SBTransformerBlock(1, 64, 8)
>>> x = block(x)
>>> x.shape
torch.Size([10, 100, 64])
forward(x)[source]

Returns the transformed output.

Parameters

x (torch.Tensor) –

Tensor shape [B, L, N], where, B = Batchsize,

L = time points N = number of filters

training: bool
class speechbrain.lobes.models.dual_path.SBConformerEncoderBlock(num_layers, d_model, nhead, d_ffn=2048, input_shape=None, kdim=None, vdim=None, dropout=0.1, activation='swish', kernel_size=31, bias=True, use_positional_encoding=False)[source]

Bases: torch.nn.modules.module.Module

A wrapper for the SpeechBrain implementation of the ConformerEncoder.

Parameters
  • num_layers (int) – Number of layers.

  • d_model (int) – Dimensionality of the representation.

  • nhead (int) – Number of attention heads.

  • d_ffn (int) – Dimensionality of positional feed forward.

  • input_shape (tuple) – Shape of input.

  • kdim (int) – Dimension of the key (Optional).

  • vdim (int) – Dimension of the value (Optional).

  • dropout (float) – Dropout rate.

  • activation (str) – Activation function.

  • kernel_size (int) – Kernel size in the conformer encoder

  • bias (bool) – Use bias or not in the convolution part of conformer encoder

  • use_positional_encoding (bool) – If true we use a positional encoding.

Example

>>> x = torch.randn(10, 100, 64)
>>> block = SBConformerEncoderBlock(1, 64, 8)
>>> x = block(x)
>>> x.shape
torch.Size([10, 100, 64])
forward(x)[source]

Returns the transformed output.

Parameters

x (torch.Tensor) –

Tensor shape [B, L, N], where, B = Batchsize,

L = time points N = number of filters

training: bool
class speechbrain.lobes.models.dual_path.SBRNNBlock(input_size, hidden_channels, num_layers, rnn_type='LSTM', dropout=0, bidirectional=True)[source]

Bases: torch.nn.modules.module.Module

RNNBlock for the dual path pipeline.

Parameters
  • input_size (int) – Dimensionality of the input features.

  • hidden_channels (int) – Dimensionality of the latent layer of the rnn.

  • num_layers (int) – Number of the rnn layers.

  • rnn_type (str) – Type of the the rnn cell.

  • dropout (float) – Dropout rate

  • bidirectional (bool) – If True, bidirectional.

Example

>>> x = torch.randn(10, 100, 64)
>>> rnn = SBRNNBlock(64, 100, 1, bidirectional=True)
>>> x = rnn(x)
>>> x.shape
torch.Size([10, 100, 200])
forward(x)[source]

Returns the transformed output.

Parameters

x (torch.Tensor) –

[B, L, N] where, B = Batchsize,

N = number of filters L = time points

training: bool
class speechbrain.lobes.models.dual_path.DPTNetBlock(d_model, nhead, dim_feedforward=256, dropout=0, activation='relu')[source]

Bases: torch.nn.modules.module.Module

The DPT Net block.

Parameters
  • d_model (int) – Number of expected features in the input (required).

  • nhead (int) – Number of heads in the multiheadattention models (required).

  • dim_feedforward (int) – Dimension of the feedforward network model (default=2048).

  • dropout (float) – Dropout value (default=0.1).

  • activation (str) – Activation function of intermediate layer, relu or gelu (default=relu).

Examples

>>> encoder_layer = DPTNetBlock(d_model=512, nhead=8)
>>> src = torch.rand(10, 100, 512)
>>> out = encoder_layer(src)
>>> out.shape
torch.Size([10, 100, 512])
forward(src)[source]

Pass the input through the encoder layer.

Parameters

src (torch.Tensor) –

Tensor shape [B, L, N] where, B = Batchsize,

N = number of filters L = time points

training: bool
class speechbrain.lobes.models.dual_path.Dual_Computation_Block(intra_mdl, inter_mdl, out_channels, norm='ln', skip_around_intra=True, linear_layer_after_inter_intra=True)[source]

Bases: torch.nn.modules.module.Module

Computation block for dual-path processing.

Parameters

intra_mdl (torch.nn.module) –

Model to process within the chunks. inter_mdl : torch.nn.module

Model to process across the chunks.

out_channelsint

Dimensionality of inter/intra model.

normstr

Normalization type.

skip_around_intrabool

Skip connection around the intra layer.

linear_layer_after_inter_intrabool

Linear layer or not after inter or intra.

Example

>>> intra_block = SBTransformerBlock(1, 64, 8)
>>> inter_block = SBTransformerBlock(1, 64, 8)
>>> dual_comp_block = Dual_Computation_Block(intra_block, inter_block, 64)
>>> x = torch.randn(10, 64, 100, 10)
>>> x = dual_comp_block(x)
>>> x.shape
torch.Size([10, 64, 100, 10])
forward(x)[source]

Returns the output tensor.

Parameters

x (torch.Tensor) – Input tensor of dimension [B, N, K, S].

Returns

out – Output tensor of dimension [B, N, K, S]. where, B = Batchsize,

N = number of filters K = time points in each chunk S = the number of chunks

Return type

torch.Tensor

training: bool
class speechbrain.lobes.models.dual_path.Dual_Path_Model(in_channels, out_channels, intra_model, inter_model, num_layers=1, norm='ln', K=200, num_spks=2, skip_around_intra=True, linear_layer_after_inter_intra=True, use_global_pos_enc=False, max_length=20000)[source]

Bases: torch.nn.modules.module.Module

The dual path model which is the basis for dualpathrnn, sepformer, dptnet.

Parameters
  • in_channels (int) – Number of channels at the output of the encoder.

  • out_channels (int) – Number of channels that would be inputted to the intra and inter blocks.

  • intra_model (torch.nn.module) – Model to process within the chunks.

  • inter_model (torch.nn.module) – model to process across the chunks,

  • num_layers (int) – Number of layers of Dual Computation Block.

  • norm (str) – Normalization type.

  • K (int) – Chunk length.

  • num_spks (int) – Number of sources (speakers).

  • skip_around_intra (bool) – Skip connection around intra.

  • linear_layer_after_inter_intra (bool) – Linear layer after inter and intra.

  • use_global_pos_enc (bool) – Global positional encodings.

  • max_length (int) – Maximum sequence length.

Example

>>> intra_block = SBTransformerBlock(1, 64, 8)
>>> inter_block = SBTransformerBlock(1, 64, 8)
>>> dual_path_model = Dual_Path_Model(64, 64, intra_block, inter_block, num_spks=2)
>>> x = torch.randn(10, 64, 2000)
>>> x = dual_path_model(x)
>>> x.shape
torch.Size([2, 10, 64, 2000])
forward(x)[source]

Returns the output tensor.

Parameters

x (torch.Tensor) – Input tensor of dimension [B, N, L].

Returns

out – Output tensor of dimension [spks, B, N, L] where, spks = Number of speakers

B = Batchsize, N = number of filters L = the number of time points

Return type

torch.Tensor

training: bool
class speechbrain.lobes.models.dual_path.SepformerWrapper(encoder_kernel_size=16, encoder_in_nchannels=1, encoder_out_nchannels=256, masknet_chunksize=250, masknet_numlayers=2, masknet_norm='ln', masknet_useextralinearlayer=False, masknet_extraskipconnection=True, masknet_numspks=2, intra_numlayers=8, inter_numlayers=8, intra_nhead=8, inter_nhead=8, intra_dffn=1024, inter_dffn=1024, intra_use_positional=True, inter_use_positional=True, intra_norm_before=True, inter_norm_before=True)[source]

Bases: torch.nn.modules.module.Module

The wrapper for the sepformer model which combines the Encoder, Masknet and the decoder https://arxiv.org/abs/2010.13154

Parameters
  • encoder_kernel_size (int,) – The kernel size used in the encoder

  • encoder_in_nchannels (int,) – The number of channels of the input audio

  • encoder_out_nchannels (int,) – The number of filters used in the encoder. Also, number of channels that would be inputted to the intra and inter blocks.

  • masknet_chunksize (int,) – The chunk length that is to be processed by the intra blocks

  • masknet_numlayers (int,) – The number of layers of combination of inter and intra blocks

  • masknet_norm (str,) –

    The normalization type to be used in the masknet Should be one of ‘ln’ – layernorm, ‘gln’ – globallayernorm

    ’cln’ – cumulative layernorm, ‘bn’ – batchnorm – see the select_norm function above for more details

  • masknet_useextralinearlayer (bool,) – Whether or not to use a linear layer at the output of intra and inter blocks

  • masknet_extraskipconnection (bool,) – This introduces extra skip connections around the intra block

  • masknet_numspks (int,) – This determines the number of speakers to estimate

  • intra_numlayers (int,) – This determines the number of layers in the intra block

  • inter_numlayers (int,) – This determines the number of layers in the inter block

  • intra_nhead (int,) – This determines the number of parallel attention heads in the intra block

  • inter_nhead (int,) – This determines the number of parallel attention heads in the inter block

  • intra_dffn (int,) – The number of dimensions in the positional feedforward model in the inter block

  • inter_dffn (int,) – The number of dimensions in the positional feedforward model in the intra block

  • intra_use_positional (bool,) – Whether or not to use positional encodings in the intra block

  • inter_use_positional (bool,) – Whether or not to use positional encodings in the inter block

  • intra_norm_before (bool) – Whether or not we use normalization before the transformations in the intra block

  • inter_norm_before (bool) – Whether or not we use normalization before the transformations in the inter block

Example

>>> model = SepformerWrapper()
>>> inp = torch.rand(1, 160)
>>> result = model.forward(inp)
>>> result.shape
torch.Size([1, 160, 2])
training: bool
reset_layer_recursively(layer)[source]

Reinitializes the parameters of the network

forward(mix)[source]