speechbrain.lobes.models.resepformer module

Library for the Reseource-Efficient Sepformer.

Authors

Cem Subakan 2022

Summary

Classes:

`MemLSTM`	the Mem-LSTM of SkiM -- Note: This is taken from the SkiM implementation in ESPNet toolkit and modified for compability with SpeechBrain.
`ResourceEfficientSeparationPipeline`	Resource Efficient Separation Pipeline Used for RE-SepFormer and SkiM
`ResourceEfficientSeparator`	Resource Efficient Source Separator This is the class that implements RE-SepFormer
`SBRNNBlock`	RNNBlock with output layer.
`SBTransformerBlock_wnormandskip`	A wrapper for the SpeechBrain implementation of the transformer encoder.
`SegLSTM`	the Segment-LSTM of SkiM Note: This is taken from the SkiM implementation in ESPNet toolkit and modified for compatibility with SpeechBrain.

Reference

class speechbrain.lobes.models.resepformer.MemLSTM(hidden_size, dropout=0.0, bidirectional=False, mem_type='hc', norm_type='cln')[source]

Bases: Module

the Mem-LSTM of SkiM – Note: This is taken from the SkiM implementation in ESPNet toolkit and modified for compability with SpeechBrain.

Arguments:

hidden_size: int,: Dimension of the hidden state.
dropout: float,: dropout ratio. Default is 0.
bidirectional: bool,: Whether the LSTM layers are bidirectional. Default is False.
mem_type: ‘hc’, ‘h’, ‘c’, or ‘id’.: This controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned.
norm_type: ‘gln’, ‘cln’: This selects the type of normalization cln is for causal implemention

Example

>>> x = (torch.randn(1, 5, 64), torch.randn(1, 5, 64))
>>> block = MemLSTM(64)
>>> x = block(x, 5)
>>> x[0].shape
torch.Size([1, 5, 64])

forward(hc, S)[source]

The forward function for the memory RNN

Parameters:

hc (torch.Tensor) –
(h, c), tuple of hidden and cell states from SegLSTM

shape of h and c: (d, B*S, H)

where d is the number of directions
B is the batchsize S is the number chunks H is the latent dimensionality
S (int) – S is the number of chunks

training: bool

class speechbrain.lobes.models.resepformer.SegLSTM(input_size, hidden_size, dropout=0.0, bidirectional=False, norm_type='cLN')[source]

Bases: Module

the Segment-LSTM of SkiM Note: This is taken from the SkiM implementation in ESPNet toolkit and modified for compatibility with SpeechBrain.

Arguments:

input_size: int,: dimension of the input feature. The input should have shape (batch, seq_len, input_size).
hidden_size: int,: dimension of the hidden state.
dropout: float,: dropout ratio. Default is 0.
bidirectional: bool,: whether the LSTM layers are bidirectional. Default is False.
norm_type: gln, cln.: This selects the type of normalization cln is for causal implementation.

Example

>>> x = torch.randn(3, 20, 64)
>>> hc = None
>>> seglstm = SegLSTM(64, 64)
>>> y = seglstm(x, hc)
>>> y[0].shape
torch.Size([3, 20, 64])

forward(input, hc)[source]

The forward function of the Segment LSTM

Parameters:

input (torch.Tensor of size [B*S, T, H]) –

where B is the batchsize: S is the number of chunks T is the chunks size H is the latent dimensionality

(h, c), tuple of hidden and cell states from SegLSTM

shape of h and c: (d, B*S, H)

where d is the number of directions: B is the batchsize S is the number chunks H is the latent dimensionality

training: bool

class speechbrain.lobes.models.resepformer.SBRNNBlock(input_size, hidden_channels, num_layers, outsize, rnn_type='LSTM', dropout=0, bidirectional=True)[source]

Bases: Module

RNNBlock with output layer.

Parameters:

input_size (int) – Dimensionality of the input features.
hidden_channels (int) – Dimensionality of the latent layer of the rnn.
num_layers (int) – Number of the rnn layers.
out_size (int) – Number of dimensions at the output of the linear layer
rnn_type (str) – Type of the the rnn cell.
dropout (float) – Dropout rate
bidirectional (bool) – If True, bidirectional.

Example

>>> x = torch.randn(10, 100, 64)
>>> rnn = SBRNNBlock(64, 100, 1, 128, bidirectional=True)
>>> x = rnn(x)
>>> x.shape
torch.Size([10, 100, 128])

forward(x)[source]

Returns the transformed output.

Parameters:

x (torch.Tensor) –

[B, L, N] where, B = Batchsize,

N = number of filters L = time points

training: bool

class speechbrain.lobes.models.resepformer.SBTransformerBlock_wnormandskip(num_layers, d_model, nhead, d_ffn=2048, input_shape=None, kdim=None, vdim=None, dropout=0.1, activation='relu', use_positional_encoding=False, norm_before=False, attention_type='regularMHA', causal=False, use_norm=True, use_skip=True, norm_type='gln')[source]

Bases: Module

A wrapper for the SpeechBrain implementation of the transformer encoder.

Parameters:

num_layers (int) – Number of layers.
d_model (int) – Dimensionality of the representation.
nhead (int) – Number of attention heads.
d_ffn (int) – Dimensionality of positional feed forward.
input_shape (tuple) – Shape of input.
kdim (int) – Dimension of the key (Optional).
vdim (int) – Dimension of the value (Optional).
dropout (float) – Dropout rate.
activation (str) – Activation function.
use_positional_encoding (bool) – If true we use a positional encoding.
norm_before (bool) – Use normalization before transformations.

Example

>>> x = torch.randn(10, 100, 64)
>>> block = SBTransformerBlock_wnormandskip(1, 64, 8)
>>> x = block(x)
>>> x.shape
torch.Size([10, 100, 64])

forward(x)[source]

Returns the transformed output.

Parameters:

x (torch.Tensor) –

Tensor shape [B, L, N], where, B = Batchsize,

L = time points N = number of filters

training: bool

class speechbrain.lobes.models.resepformer.ResourceEfficientSeparationPipeline(input_size, hidden_size, output_size, dropout=0.0, num_blocks=2, segment_size=20, bidirectional=True, mem_type='av', norm_type='gln', seg_model=None, mem_model=None)[source]

Bases: Module

Resource Efficient Separation Pipeline Used for RE-SepFormer and SkiM

Note: This implementation is a generalization of the ESPNET implementation of SkiM

Arguments:

input_size: int,: Dimension of the input feature. Input shape shoud be (batch, length, input_size)
hidden_size: int,: Dimension of the hidden state.
output_size: int,: Dimension of the output size.
dropout: float,: Dropout ratio. Default is 0.
num_blocks: int: Number of basic SkiM blocks
segment_size: int: Segmentation size for splitting long features
bidirectional: bool,: Whether the RNN layers are bidirectional.
mem_type: ‘hc’, ‘h’, ‘c’, ‘id’ or None.: This controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.
norm_type: gln, cln.: cln is for causal implementation.
seg_model: class: The model that processes the within segment elements
mem_model: class: The memory model that ensures continuity between the segments

Example

>>> x = torch.randn(10, 100, 64)
>>> seg_mdl = SBTransformerBlock_wnormandskip(1, 64, 8)
>>> mem_mdl = SBTransformerBlock_wnormandskip(1, 64, 8)
>>> resepf_pipeline = ResourceEfficientSeparationPipeline(64, 64, 128, seg_model=seg_mdl, mem_model=mem_mdl)
>>> out = resepf_pipeline.forward(x)
>>> out.shape
torch.Size([10, 100, 128])

forward(input)[source]

The forward function of the ResourceEfficientSeparatioPipeline

This takes in a tensor of size [B, (S*K), D]

Parameters:

input (torch.Tensor) –

Tensor shape [B, (S*K), D], where, B = Batchsize,

S = Number of chunks K = Chunksize D = number of features

training: bool

class speechbrain.lobes.models.resepformer.ResourceEfficientSeparator(input_dim: int, causal: bool = True, num_spk: int = 2, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0, mem_type: str = 'hc', seg_model=None, mem_model=None)[source]

Bases: Module

Resource Efficient Source Separator This is the class that implements RE-SepFormer

Arguments:

input_dim: int,: Input feature dimension
causal: bool,: Whether the system is causal.
num_spk: int,: Number of target speakers.
nonlinear: class: the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
layer: int,: number of blocks. Default is 2 for RE-SepFormer.
unit: int,: Dimensionality of the hidden state.
segment_size: int,: Chunk size for splitting long features
dropout: float,: dropout ratio. Default is 0.
mem_type: ‘hc’, ‘h’, ‘c’, ‘id’, ‘av’ or None.: This controls whether a memory representation will be used to ensure continuity between segments. In ‘av’ mode, the summary state is is calculated by simply averaging over the time dimension of each segment In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the memory model will be removed.
seg_model: class,: The model that processes the within segment elements
mem_model: class,: The memory model that ensures continuity between the segments

Example

>>> x = torch.randn(10, 64, 100)
>>> seg_mdl = SBTransformerBlock_wnormandskip(1, 64, 8)
>>> mem_mdl = SBTransformerBlock_wnormandskip(1, 64, 8)
>>> resepformer = ResourceEfficientSeparator(64, num_spk=3, mem_type='av', seg_model=seg_mdl, mem_model=mem_mdl)
>>> out = resepformer.forward(x)
>>> out.shape
torch.Size([3, 10, 64, 100])

forward(inpt: Tensor)[source]

Forward. Arguments: ———-

inpt (torch.Tensor):
Encoded feature [B, T, N]

training: bool