speechbrain.lobes.models.conv_tasnet module

Implementation of a popular speech separation model.

Summary

Classes:

`ChannelwiseLayerNorm`	Channel-wise Layer Normalization (cLN).
`Chomp1d`	This class cuts out a portion of the signal from the end.
`Decoder`	This class implements the decoder for the ConvTasnet.
`DepthwiseSeparableConv`	Building block for the Temporal Blocks of Masknet in ConvTasNet.
`Encoder`	This class learns the adaptive frontend for the ConvTasnet model.
`GlobalLayerNorm`	Global Layer Normalization (gLN).
`MaskNet`
`TemporalBlock`	The conv1d compound layers used in Masknet.
`TemporalBlocksSequential`	A wrapper for the temporal-block layer to replicate it

Functions:

choose_norm

This function returns the chosen normalization type.

Reference

class speechbrain.lobes.models.conv_tasnet.Encoder(L, N)[source]

Bases: Module

This class learns the adaptive frontend for the ConvTasnet model.

Parameters:

L (int) – The filter kernel size. Needs to be an odd number.
N (int) – Number of dimensions at the output of the adaptive front end.

Example

>>> inp = torch.rand(10, 100)
>>> encoder = Encoder(11, 20)
>>> h = encoder(inp)
>>> h.shape
torch.Size([10, 20, 20])

forward(mixture)[source]

Parameters:: mixture (Tensor) – Tensor shape is [M, T]. M is batch size. T is #samples
Returns:: mixture_w – Tensor shape is [M, K, N], where K = (T-L)/(L/2)+1 = 2T/L-1
Return type:: Tensor

class speechbrain.lobes.models.conv_tasnet.Decoder(L, N)[source]

Bases: Module

This class implements the decoder for the ConvTasnet.

The separated source embeddings are fed to the decoder to reconstruct the estimated sources in the time domain.

Parameters:: L (int) – Number of bases to use when reconstructing.

Example

>>> L, C, N = 8, 2, 8
>>> mixture_w = torch.randn(10, 100, N)
>>> est_mask = torch.randn(10, 100, C, N)
>>> Decoder = Decoder(L, N)
>>> mixture_hat = Decoder(mixture_w, est_mask)
>>> mixture_hat.shape
torch.Size([10, 404, 2])

forward(mixture_w, est_mask)[source]

Parameters:

mixture_w (Tensor) – Tensor shape is [M, K, N].
est_mask (Tensor) – Tensor shape is [M, K, C, N].

Returns:

est_source – Tensor shape is [M, T, C].

Return type:

Tensor

class speechbrain.lobes.models.conv_tasnet.TemporalBlocksSequential(input_shape, H, P, R, X, norm_type, causal)[source]

Bases: Sequential

A wrapper for the temporal-block layer to replicate it

Parameters:

input_shape (tuple) – Expected shape of the input.
H (int) – The number of intermediate channels.
P (int) – The kernel size in the convolutions.
R (int) – The number of times to replicate the multilayer Temporal Blocks.
X (int) – The number of layers of Temporal Blocks with different dilations.
type (norm) – The type of normalization, in [‘gLN’, ‘cLN’].
causal (bool) – To use causal or non-causal convolutions, in [True, False].

Example

>>> x = torch.randn(14, 100, 10)
>>> H, P, R, X = 10, 5, 2, 3
>>> TemporalBlocks = TemporalBlocksSequential(
...     x.shape, H, P, R, X, 'gLN', False
... )
>>> y = TemporalBlocks(x)
>>> y.shape
torch.Size([14, 100, 10])

class speechbrain.lobes.models.conv_tasnet.MaskNet(N, B, H, P, X, R, C, norm_type='gLN', causal=False, mask_nonlinear='relu')[source]

Bases: Module

Parameters:

N (>>>) – Number of filters in autoencoder.
B (int) – Number of channels in bottleneck 1 × 1-conv block.
H (int) – Number of channels in convolutional blocks.
P (int) – Kernel size in convolutional blocks.
X (int) – Number of convolutional blocks in each repeat.
R (int) – Number of repeats.
C (int) – Number of speakers.
norm_type (str) – One of BN, gLN, cLN.
causal (bool) – Causal or non-causal.
mask_nonlinear (str) – Use which non-linear function to generate mask, in [‘softmax’, ‘relu’].
Example
---------
N
B
H
P
X
R
11 (C =)
12
2
5
3
1
2
MaskNet(N (>>> MaskNet =)
B
H
P
X
R
C)
torch.randn(10 (>>> mixture_w =)
11
100)
MaskNet(mixture_w) (>>> est_mask =)
est_mask.shape (>>>)
torch.Size([2
10
11
100])

forward(mixture_w)[source]

Keep this API same with TasNet.

Parameters:: mixture_w (Tensor) – Tensor shape is [M, K, N], M is batch size.
Returns:: est_mask – Tensor shape is [M, K, C, N].
Return type:: Tensor

class speechbrain.lobes.models.conv_tasnet.TemporalBlock(input_shape, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: Module

The conv1d compound layers used in Masknet.

Parameters:

input_shape (tuple) – The expected shape of the input.
out_channels (int) – The number of intermediate channels.
kernel_size (int) – The kernel size in the convolutions.
stride (int) – Convolution stride in convolutional layers.
padding (str) – The type of padding in the convolutional layers, (same, valid, causal). If “valid”, no padding is performed.
dilation (int) – Amount of dilation in convolutional layers.
type (norm) – The type of normalization, in [‘gLN’, ‘cLN’].
causal (bool) – To use causal or non-causal convolutions, in [True, False].
Example
---------
torch.randn(14 (>>> x =)
100
10)
TemporalBlock(x.shape (>>> TemporalBlock =)
10
11
1
'same'
1)
TemporalBlock(x) (>>> y =)
y.shape (>>>)
torch.Size([14
100
10])

forward(x)[source]

Parameters:: x (Tensor) – Tensor shape is [M, K, B].
Returns:: x – Tensor shape is [M, K, B].
Return type:: Tensor

class speechbrain.lobes.models.conv_tasnet.DepthwiseSeparableConv(input_shape, out_channels, kernel_size, stride, padding, dilation, norm_type='gLN', causal=False)[source]

Bases: Sequential

Building block for the Temporal Blocks of Masknet in ConvTasNet.

Parameters:

input_shape (tuple) – Expected shape of the input.
out_channels (int) – Number of output channels.
kernel_size (int) – The kernel size in the convolutions.
stride (int) – Convolution stride in convolutional layers.
padding (str) – The type of padding in the convolutional layers, (same, valid, causal). If “valid”, no padding is performed.
dilation (int) – Amount of dilation in convolutional layers.
type (norm) – The type of normalization, in [‘gLN’, ‘cLN’].
causal (bool) – To use causal or non-causal convolutions, in [True, False].

Example

>>> x = torch.randn(14, 100, 10)
>>> DSconv = DepthwiseSeparableConv(x.shape, 10, 11, 1, 'same', 1)
>>> y = DSconv(x)
>>> y.shape
torch.Size([14, 100, 10])

class speechbrain.lobes.models.conv_tasnet.Chomp1d(chomp_size)[source]

Bases: Module

This class cuts out a portion of the signal from the end.

It is written as a class to be able to incorporate it inside a sequential wrapper.

Parameters:: chomp_size (int) – The size of the portion to discard (in samples).

Example

>>> x = torch.randn(10, 110, 5)
>>> chomp = Chomp1d(10)
>>> x_chomped = chomp(x)
>>> x_chomped.shape
torch.Size([10, 100, 5])

forward(x)[source]

Arguments x : Tensor

Tensor shape is [M, Kpad, H].

Returns:: x – Tensor shape is [M, K, H].
Return type:: Tensor

speechbrain.lobes.models.conv_tasnet.choose_norm(norm_type, channel_size)[source]

This function returns the chosen normalization type.

Parameters:

norm_type (str) – One of [‘gLN’, ‘cLN’, ‘batchnorm’].
channel_size (int) – Number of channels.

Example

>>> choose_norm('gLN', 10)
GlobalLayerNorm()

class speechbrain.lobes.models.conv_tasnet.ChannelwiseLayerNorm(channel_size)[source]

Bases: Module

Channel-wise Layer Normalization (cLN).

Parameters:: channel_size (int) – Number of channels in the normalization dimension (the third dimension).

Example

>>> x = torch.randn(2, 3, 3)
>>> norm_func = ChannelwiseLayerNorm(3)
>>> x_normalized = norm_func(x)
>>> x.shape
torch.Size([2, 3, 3])

reset_parameters()[source]: Resets the parameters.

forward(y)[source]

Args:: y: [M, K, N], M is batch size, N is channel size, K is length
Returns:: cLN_y: [M, K, N]

class speechbrain.lobes.models.conv_tasnet.GlobalLayerNorm(channel_size)[source]

Bases: Module

Global Layer Normalization (gLN).

Parameters:: channel_size (int) – Number of channels in the third dimension.

Example

>>> x = torch.randn(2, 3, 3)
>>> norm_func = GlobalLayerNorm(3)
>>> x_normalized = norm_func(x)
>>> x.shape
torch.Size([2, 3, 3])

reset_parameters()[source]: Resets the parameters.

forward(y)[source]

Parameters:: y (Tensor) – Tensor shape [M, K, N]. M is batch size, N is channel size, and K is length.
Returns:: gLN_y – Tensor shape [M, K. N]
Return type:: Tensor