speechbrain.lobes.models.resepformer module
Library for the Reseource-Efficient Sepformer.
- Authors
Cem Subakan 2022
Summary
Classes:
the Mem-LSTM of SkiM -- Note: This is taken from the SkiM implementation in ESPNet toolkit and modified for compability with SpeechBrain. |
|
Resource Efficient Separation Pipeline Used for RE-SepFormer and SkiM |
|
Resource Efficient Source Separator This is the class that implements RE-SepFormer |
|
RNNBlock with output layer. |
|
A wrapper for the SpeechBrain implementation of the transformer encoder. |
|
the Segment-LSTM of SkiM Note: This is taken from the SkiM implementation in ESPNet toolkit and modified for compatibility with SpeechBrain. |
Reference
- class speechbrain.lobes.models.resepformer.MemLSTM(hidden_size, dropout=0.0, bidirectional=False, mem_type='hc', norm_type='cln')[source]
Bases:
Module
the Mem-LSTM of SkiM – Note: This is taken from the SkiM implementation in ESPNet toolkit and modified for compability with SpeechBrain.
- hidden_size: int,
Dimension of the hidden state.
- dropout: float,
dropout ratio. Default is 0.
- bidirectional: bool,
Whether the LSTM layers are bidirectional. Default is False.
- mem_type: ‘hc’, ‘h’, ‘c’, or ‘id’.
This controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned.
- norm_type: ‘gln’, ‘cln’
This selects the type of normalization cln is for causal implemention
Example
>>> x = (torch.randn(1, 5, 64), torch.randn(1, 5, 64)) >>> block = MemLSTM(64) >>> x = block(x, 5) >>> x[0].shape torch.Size([1, 5, 64])
- forward(hc, S)[source]
The forward function for the memory RNN
- Parameters
hc (torch.Tensor) –
(h, c), tuple of hidden and cell states from SegLSTM
- shape of h and c: (d, B*S, H)
- where d is the number of directions
B is the batchsize S is the number chunks H is the latent dimensionality
S (int) – S is the number of chunks
- class speechbrain.lobes.models.resepformer.SegLSTM(input_size, hidden_size, dropout=0.0, bidirectional=False, norm_type='cLN')[source]
Bases:
Module
the Segment-LSTM of SkiM Note: This is taken from the SkiM implementation in ESPNet toolkit and modified for compatibility with SpeechBrain.
- input_size: int,
dimension of the input feature. The input should have shape (batch, seq_len, input_size).
- hidden_size: int,
dimension of the hidden state.
- dropout: float,
dropout ratio. Default is 0.
- bidirectional: bool,
whether the LSTM layers are bidirectional. Default is False.
- norm_type: gln, cln.
This selects the type of normalization cln is for causal implementation.
Example
>>> x = torch.randn(3, 20, 64) >>> hc = None >>> seglstm = SegLSTM(64, 64) >>> y = seglstm(x, hc) >>> y[0].shape torch.Size([3, 20, 64])
- forward(input, hc)[source]
The forward function of the Segment LSTM
- Parameters
input (torch.Tensor of size [B*S, T, H]) –
- where B is the batchsize
S is the number of chunks T is the chunks size H is the latent dimensionality
(h, c), tuple of hidden and cell states from SegLSTM
- shape of h and c: (d, B*S, H)
- where d is the number of directions
B is the batchsize S is the number chunks H is the latent dimensionality
- class speechbrain.lobes.models.resepformer.SBRNNBlock(input_size, hidden_channels, num_layers, outsize, rnn_type='LSTM', dropout=0, bidirectional=True)[source]
Bases:
Module
RNNBlock with output layer.
- Parameters
input_size (int) – Dimensionality of the input features.
hidden_channels (int) – Dimensionality of the latent layer of the rnn.
num_layers (int) – Number of the rnn layers.
out_size (int) – Number of dimensions at the output of the linear layer
rnn_type (str) – Type of the the rnn cell.
dropout (float) – Dropout rate
bidirectional (bool) – If True, bidirectional.
Example
>>> x = torch.randn(10, 100, 64) >>> rnn = SBRNNBlock(64, 100, 1, 128, bidirectional=True) >>> x = rnn(x) >>> x.shape torch.Size([10, 100, 128])
- forward(x)[source]
Returns the transformed output.
- Parameters
x (torch.Tensor) –
[B, L, N] where, B = Batchsize,
N = number of filters L = time points
- class speechbrain.lobes.models.resepformer.SBTransformerBlock_wnormandskip(num_layers, d_model, nhead, d_ffn=2048, input_shape=None, kdim=None, vdim=None, dropout=0.1, activation='relu', use_positional_encoding=False, norm_before=False, attention_type='regularMHA', causal=False, use_norm=True, use_skip=True, norm_type='gln')[source]
Bases:
Module
A wrapper for the SpeechBrain implementation of the transformer encoder.
- Parameters
num_layers (int) – Number of layers.
d_model (int) – Dimensionality of the representation.
nhead (int) – Number of attention heads.
d_ffn (int) – Dimensionality of positional feed forward.
input_shape (tuple) – Shape of input.
kdim (int) – Dimension of the key (Optional).
vdim (int) – Dimension of the value (Optional).
dropout (float) – Dropout rate.
activation (str) – Activation function.
use_positional_encoding (bool) – If true we use a positional encoding.
norm_before (bool) – Use normalization before transformations.
Example
>>> x = torch.randn(10, 100, 64) >>> block = SBTransformerBlock_wnormandskip(1, 64, 8) >>> x = block(x) >>> x.shape torch.Size([10, 100, 64])
- forward(x)[source]
Returns the transformed output.
- Parameters
x (torch.Tensor) –
Tensor shape [B, L, N], where, B = Batchsize,
L = time points N = number of filters
- class speechbrain.lobes.models.resepformer.ResourceEfficientSeparationPipeline(input_size, hidden_size, output_size, dropout=0.0, num_blocks=2, segment_size=20, bidirectional=True, mem_type='av', norm_type='gln', seg_model=None, mem_model=None)[source]
Bases:
Module
Resource Efficient Separation Pipeline Used for RE-SepFormer and SkiM
Note: This implementation is a generalization of the ESPNET implementation of SkiM
- input_size: int,
Dimension of the input feature. Input shape shoud be (batch, length, input_size)
- hidden_size: int,
Dimension of the hidden state.
- output_size: int,
Dimension of the output size.
- dropout: float,
Dropout ratio. Default is 0.
- num_blocks: int
Number of basic SkiM blocks
- segment_size: int
Segmentation size for splitting long features
- bidirectional: bool,
Whether the RNN layers are bidirectional.
- mem_type: ‘hc’, ‘h’, ‘c’, ‘id’ or None.
This controls whether the hidden (or cell) state of SegLSTM will be processed by MemLSTM. In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the MemLSTM will be removed.
- norm_type: gln, cln.
cln is for causal implementation.
- seg_model: class
The model that processes the within segment elements
- mem_model: class
The memory model that ensures continuity between the segments
Example
>>> x = torch.randn(10, 100, 64) >>> seg_mdl = SBTransformerBlock_wnormandskip(1, 64, 8) >>> mem_mdl = SBTransformerBlock_wnormandskip(1, 64, 8) >>> resepf_pipeline = ResourceEfficientSeparationPipeline(64, 64, 128, seg_model=seg_mdl, mem_model=mem_mdl) >>> out = resepf_pipeline.forward(x) >>> out.shape torch.Size([10, 100, 128])
- forward(input)[source]
The forward function of the ResourceEfficientSeparatioPipeline
This takes in a tensor of size [B, (S*K), D]
- Parameters
input (torch.Tensor) –
Tensor shape [B, (S*K), D], where, B = Batchsize,
S = Number of chunks K = Chunksize D = number of features
- class speechbrain.lobes.models.resepformer.ResourceEfficientSeparator(input_dim: int, causal: bool = True, num_spk: int = 2, nonlinear: str = 'relu', layer: int = 3, unit: int = 512, segment_size: int = 20, dropout: float = 0.0, mem_type: str = 'hc', seg_model=None, mem_model=None)[source]
Bases:
Module
Resource Efficient Source Separator This is the class that implements RE-SepFormer
- input_dim: int,
Input feature dimension
- causal: bool,
Whether the system is causal.
- num_spk: int,
Number of target speakers.
- nonlinear: class
the nonlinear function for mask estimation, select from ‘relu’, ‘tanh’, ‘sigmoid’
- layer: int,
number of blocks. Default is 2 for RE-SepFormer.
- unit: int,
Dimensionality of the hidden state.
- segment_size: int,
Chunk size for splitting long features
- dropout: float,
dropout ratio. Default is 0.
- mem_type: ‘hc’, ‘h’, ‘c’, ‘id’, ‘av’ or None.
This controls whether a memory representation will be used to ensure continuity between segments. In ‘av’ mode, the summary state is is calculated by simply averaging over the time dimension of each segment In ‘id’ mode, both the hidden and cell states will be identically returned. When mem_type is None, the memory model will be removed.
- seg_model: class,
The model that processes the within segment elements
- mem_model: class,
The memory model that ensures continuity between the segments
Example
>>> x = torch.randn(10, 64, 100) >>> seg_mdl = SBTransformerBlock_wnormandskip(1, 64, 8) >>> mem_mdl = SBTransformerBlock_wnormandskip(1, 64, 8) >>> resepformer = ResourceEfficientSeparator(64, num_spk=3, mem_type='av', seg_model=seg_mdl, mem_model=mem_mdl) >>> out = resepformer.forward(x) >>> out.shape torch.Size([3, 10, 64, 100])