speechbrain.nnet.RNN module¶

Library implementing recurrent neural networks.

Authors

Mirco Ravanelli 2020
Ju-Chieh Chou 2020
Jianyuan Zhong 2020
Loren Lugosch 2020

Summary¶

Classes:

`AttentionalRNNDecoder`	This function implements RNN decoder model with attention.
`GRU`	This function implements a basic GRU.
`GRUCell`	This class implements a basic GRU Cell for a timestep of input, while GRU() takes the whole sequence as input.
`LSTM`	This function implements a basic LSTM.
`LSTMCell`	This class implements a basic LSTM Cell for a timestep of input, while LSTM() takes the whole sequence as input.
`LiGRU`	This function implements a Light GRU (liGRU).
`LiGRU_Layer`	This function implements Light-Gated Recurrent Units (ligru) layer.
`QuasiRNN`	This is a implementation for the Quasi-RNN.
`QuasiRNNLayer`	Applies a single layer Quasi-Recurrent Neural Network (QRNN) to an input sequence.
`RNN`	This function implements a vanilla RNN.
`RNNCell`	This class implements a basic RNN Cell for a timestep of input, while RNN() takes the whole sequence as input.

Functions:

`pack_padded_sequence`	Returns packed speechbrain-formatted tensors.
`pad_packed_sequence`	Returns speechbrain-formatted tensor from packed sequences.
`rnn_init`	This function is used to initialize the RNN weight.

Reference¶

speechbrain.nnet.RNN.pack_padded_sequence(inputs, lengths)[source]¶

Returns packed speechbrain-formatted tensors.

Parameters

inputs (torch.Tensor) – The sequences to pack.
lengths (torch.Tensor) – The length of each sequence.

speechbrain.nnet.RNN.pad_packed_sequence(inputs)[source]¶

Returns speechbrain-formatted tensor from packed sequences.

Parameters: inputs (torch.nn.utils.rnn.PackedSequence) – An input set of sequences to convert to a tensor.

class speechbrain.nnet.RNN.RNN(hidden_size, input_shape=None, input_size=None, nonlinearity='relu', num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]¶

Bases: torch.nn.modules.module.Module

This function implements a vanilla RNN.

It accepts in input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

Parameters

hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).
input_shape (tuple) – The shape of an example input. Alternatively, use input_size.
input_size (int) – The size of the input. Alternatively, use input_shape.
nonlinearity (str) – Type of nonlinearity (tanh, relu).
num_layers (int) – Number of layers to employ in the RNN architecture.
bias (bool) – If True, the additive bias b is adopted.
dropout (float) – It is the dropout factor (must be between 0 and 1).
re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.
bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = RNN(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>>
torch.Size([4, 10, 5])

forward(x, hx=None, lengths=None)[source]¶

Returns the output of the vanilla RNN.

Parameters

x (torch.Tensor) – Input tensor.
hx (torch.Tensor) – Starting hidden state.
lengths (torch.Tensor) – Relative lengths of the input signals.

training: bool¶

class speechbrain.nnet.RNN.LSTM(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]¶

Bases: torch.nn.modules.module.Module

This function implements a basic LSTM.

It accepts in input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

Parameters

hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).
input_shape (tuple) – The shape of an example input. Alternatively, use input_size.
input_size (int) – The size of the input. Alternatively, use input_shape.
num_layers (int) – Number of layers to employ in the RNN architecture.
bias (bool) – If True, the additive bias b is adopted.
dropout (float) – It is the dropout factor (must be between 0 and 1).
re_init (bool) – It True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.
bidirectional (bool) – If True, a bidirectinoal model that scans the sequence both right-to-left and left-to-right is used.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = LSTM(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor = net(inp_tensor)
>>>
torch.Size([4, 10, 5])

forward(x, hx=None, lengths=None)[source]¶

Returns the output of the LSTM.

Parameters

x (torch.Tensor) – Input tensor.
hx (torch.Tensor) – Starting hidden state.
lengths (torch.Tensor) – Relative length of the input signals.

training: bool¶

class speechbrain.nnet.RNN.GRU(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]¶

Bases: torch.nn.modules.module.Module

This function implements a basic GRU.

It accepts input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

Parameters

hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).
input_shape (tuple) – The shape of an example input. Alternatively, use input_size.
input_size (int) – The size of the input. Alternatively, use input_shape.
num_layers (int) – Number of layers to employ in the RNN architecture.
bias (bool) – If True, the additive bias b is adopted.
t (dropou) – It is the dropout factor (must be between 0 and 1).
re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.
bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = GRU(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>>
torch.Size([4, 10, 5])

forward(x, hx=None, lengths=None)[source]¶

Returns the output of the GRU.

Parameters

x (torch.Tensor) – Input tensor.
hx (torch.Tensor) – Starting hidden state.
lengths (torch.Tensor) – Relative length of the input signals.

training: bool¶

class speechbrain.nnet.RNN.RNNCell(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True, nonlinearity='tanh')[source]¶

Bases: torch.nn.modules.module.Module

This class implements a basic RNN Cell for a timestep of input, while RNN() takes the whole sequence as input.

It is designed for an autoregressive decoder (ex. attentional decoder), which takes one input at a time. Using torch.nn.RNNCell() instead of torch.nn.RNN() to reduce VRAM consumption.

It accepts in input tensors formatted as (batch, fea).

Parameters

hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output).
input_shape (tuple) – The shape of an example input. Alternatively, use input_size.
input_size (int) – The size of the input. Alternatively, use input_shape.
num_layers (int) – Number of layers to employ in the RNN architecture.
bias (bool) – If True, the additive bias b is adopted.
dropout (float) – It is the dropout factor (must be between 0 and 1).
re_init (bool) – It True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

Example

>>> inp_tensor = torch.rand([4, 20])
>>> net = RNNCell(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 5])

forward(x, hx=None)[source]¶

Returns the output of the RNNCell.

Parameters

x (torch.Tensor) – The input of RNNCell.
hx (torch.Tensor) – The hidden states of RNNCell.

training: bool¶

class speechbrain.nnet.RNN.GRUCell(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True)[source]¶

Bases: torch.nn.modules.module.Module

This class implements a basic GRU Cell for a timestep of input, while GRU() takes the whole sequence as input.

It is designed for an autoregressive decoder (ex. attentional decoder), which takes one input at a time. Using torch.nn.GRUCell() instead of torch.nn.GRU() to reduce VRAM consumption. It accepts in input tensors formatted as (batch, fea).

Parameters

hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output).
input_shape (tuple) – The shape of an example input. Alternatively, use input_size.
input_size (int) – The size of the input. Alternatively, use input_shape.
num_layers (int) – Number of layers to employ in the GRU architecture.
bias (bool) – If True, the additive bias b is adopted.
dropout (float) – It is the dropout factor (must be between 0 and 1).
re_init (bool) – It True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

Example

>>> inp_tensor = torch.rand([4, 20])
>>> net = GRUCell(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 5])

forward(x, hx=None)[source]¶

Returns the output of the GRUCell.

Parameters

x (torch.Tensor) – The input of GRUCell.
hx (torch.Tensor) – The hidden states of GRUCell.

training: bool¶

class speechbrain.nnet.RNN.LSTMCell(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True)[source]¶

Bases: torch.nn.modules.module.Module

This class implements a basic LSTM Cell for a timestep of input, while LSTM() takes the whole sequence as input.

It is designed for an autoregressive decoder (ex. attentional decoder), which takes one input at a time. Using torch.nn.LSTMCell() instead of torch.nn.LSTM() to reduce VRAM consumption. It accepts in input tensors formatted as (batch, fea).

Parameters

hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output).
input_shape (tuple) – The shape of an example input. Alternatively, use input_size.
input_size (int) – The size of the input. Alternatively, use input_shape.
num_layers (int) – Number of layers to employ in the LSTM architecture.
bias (bool) – If True, the additive bias b is adopted.
dropout (float) – It is the dropout factor (must be between 0 and 1).
re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

Example

>>> inp_tensor = torch.rand([4, 20])
>>> net = LSTMCell(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 5])

forward(x, hx=None)[source]¶

Returns the output of the LSTMCell.

Parameters

x (torch.Tensor) – The input of LSTMCell.
hx (torch.Tensor) – The hidden states of LSTMCell.

training: bool¶

class speechbrain.nnet.RNN.AttentionalRNNDecoder(rnn_type, attn_type, hidden_size, attn_dim, num_layers, enc_dim, input_size, nonlinearity='relu', re_init=True, normalization='batchnorm', scaling=1.0, channels=None, kernel_size=None, bias=True, dropout=0.0)[source]¶

Bases: torch.nn.modules.module.Module

This function implements RNN decoder model with attention.

This function implements different RNN models. It accepts in enc_states tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened in this way: (batch, time, fea*channel).

Parameters

rnn_type (str) – Type of recurrent neural network to use (rnn, lstm, gru).
attn_type (str) – type of attention to use (location, content).
hidden_size (int) – Number of the neurons.
attn_dim (int) – Number of attention module internal and output neurons.
num_layers (int) – Number of layers to employ in the RNN architecture.
input_shape (tuple) – Expected shape of an input.
input_size (int) – Expected size of the relevant input dimension.
nonlinearity (str) – Type of nonlinearity (tanh, relu). This option is active for rnn and ligru models only. For lstm and gru tanh is used.
re_init (bool) – It True, orthogonal init is used for the recurrent weights. Xavier initialization is used for the input connection weights.
normalization (str) – Type of normalization for the ligru model (batchnorm, layernorm). Every string different from batchnorm and layernorm will result in no normalization.
scaling (float) – A scaling factor to sharpen or smoothen the attention distribution.
channels (int) – Number of channels for location-aware attention.
kernel_size (int) – Size of the kernel for location-aware attention.
bias (bool) – If True, the additive bias b is adopted.
dropout (float) – It is the dropout factor (must be between 0 and 1).

Example

>>> enc_states = torch.rand([4, 10, 20])
>>> wav_len = torch.rand([4])
>>> inp_tensor = torch.rand([4, 5, 6])
>>> net = AttentionalRNNDecoder(
...     rnn_type="lstm",
...     attn_type="content",
...     hidden_size=7,
...     attn_dim=5,
...     num_layers=1,
...     enc_dim=20,
...     input_size=6,
... )
>>> out_tensor, attn = net(inp_tensor, enc_states, wav_len)
>>> out_tensor.shape
torch.Size([4, 5, 7])

forward_step(inp, hs, c, enc_states, enc_len)[source]¶

One step of forward pass process.

Parameters

inp (torch.Tensor) – The input of current timestep.
hs (torch.Tensor or tuple of torch.Tensor) – The cell state for RNN.
c (torch.Tensor) – The context vector of previous timestep.
enc_states (torch.Tensor) – The tensor generated by encoder, to be attended.
enc_len (torch.LongTensor) – The actual length of encoder states.

Returns

dec_out (torch.Tensor) – The output tensor.
hs (torch.Tensor or tuple of torch.Tensor) – The new cell state for RNN.
c (torch.Tensor) – The context vector of the current timestep.
w (torch.Tensor) – The weight of attention.

forward(inp_tensor, enc_states, wav_len)[source]¶

This method implements the forward pass of the attentional RNN decoder.

Parameters

inp_tensor (torch.Tensor) – The input tensor for each timesteps of RNN decoder.
enc_states (torch.Tensor) – The tensor to be attended by the decoder.
wav_len (torch.Tensor) – This variable stores the relative length of wavform.

Returns

outputs (torch.Tensor) – The output of the RNN decoder.
attn (torch.Tensor) – The attention weight of each timestep.

training: bool¶

class speechbrain.nnet.RNN.LiGRU(hidden_size, input_shape, nonlinearity='relu', normalization='batchnorm', num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]¶

Bases: torch.nn.modules.module.Module

This function implements a Light GRU (liGRU).

Ligru is single-gate GRU model based on batch-norm + relu activations + recurrent dropout. For more info see:

“M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, Light Gated Recurrent Units for Speech Recognition, in IEEE Transactions on Emerging Topics in Computational Intelligence, 2018” (https://arxiv.org/abs/1803.10225)

To speed it up, it is compiled with the torch just-in-time compiler (jit) right before using it.

It accepts in input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

Parameters

hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).
input_shape (tuple) – The shape of an example input.
nonlinearity (str) – Type of nonlinearity (tanh, relu).
normalization (str) – Type of normalization for the ligru model (batchnorm, layernorm). Every string different from batchnorm and layernorm will result in no normalization.
num_layers (int) – Number of layers to employ in the RNN architecture.
bias (bool) – If True, the additive bias b is adopted.
dropout (float) – It is the dropout factor (must be between 0 and 1).
re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.
bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = LiGRU(input_shape=inp_tensor.shape, hidden_size=5)
>>> out_tensor, _ = net(inp_tensor)
>>>
torch.Size([4, 10, 5])

forward(x, hx: Optional[torch.Tensor] = None)[source]¶

Returns the output of the liGRU.

Parameters

x (torch.Tensor) – The input tensor.
hx (torch.Tensor) – Starting hidden state.

training: bool¶

class speechbrain.nnet.RNN.LiGRU_Layer(input_size, hidden_size, num_layers, batch_size, dropout=0.0, nonlinearity='relu', normalization='batchnorm', bidirectional=False)[source]¶

Bases: torch.nn.modules.module.Module

This function implements Light-Gated Recurrent Units (ligru) layer.

Parameters

input_size (int) – Feature dimensionality of the input tensors.
batch_size (int) – Batch size of the input tensors.
hidden_size (int) – Number of output neurons.
num_layers (int) – Number of layers to employ in the RNN architecture.
nonlinearity (str) – Type of nonlinearity (tanh, relu).
normalization (str) – Type of normalization (batchnorm, layernorm). Every string different from batchnorm and layernorm will result in no normalization.
dropout (float) – It is the dropout factor (must be between 0 and 1).
bidirectional (bool) – if True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

forward(x: torch.Tensor, hx: Optional[torch.Tensor] = None) → torch.Tensor [source]¶

Returns the output of the liGRU layer.

Parameters: x (torch.Tensor) – Input tensor.

training: bool¶

class speechbrain.nnet.RNN.QuasiRNNLayer(input_size, hidden_size, bidirectional, zoneout=0.0, output_gate=True)[source]¶

Bases: torch.nn.modules.module.Module

Applies a single layer Quasi-Recurrent Neural Network (QRNN) to an input sequence.

Parameters

input_size (int) – The number of expected features in the input x.
hidden_size (int) – The number of features in the hidden state h. If not specified, the input size is used.
zoneout (float) – Whether to apply zoneout (i.e. failing to update elements in the hidden state) to the hidden state updates. Default: 0.
output_gate (bool) – If True, performs QRNN-fo (applying an output gate to the output). If False, performs QRNN-f. Default: True.

Example

>>> import torch
>>> model = QuasiRNNLayer(60, 256, bidirectional=True)
>>> a = torch.rand([10, 120, 60])
>>> b = model(a)
>>> b[0].shape
torch.Size([10, 120, 512])

training: bool¶

forgetMult(f: torch.Tensor, x: torch.Tensor, hidden: Optional[torch.Tensor]) → torch.Tensor [source]¶

Returns the hidden states for each time step.

Parameters: wx (torch.Tensor) – Linearly transformed input.

split_gate_inputs(y: Tensor) → Tuple[Tensor, Tensor, Optional[Tensor]][source]¶

forward(x: Tensor, hidden: Optional[Tensor] = None) → Tuple[Tensor, Tensor][source]¶

Returns the output of the QRNN layer.

Parameters: x (torch.Tensor) – Input to transform linearly.

class speechbrain.nnet.RNN.QuasiRNN(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, batch_first=False, dropout=0, bidirectional=False, **kwargs)[source]¶

Bases: torch.nn.modules.module.Module

This is a implementation for the Quasi-RNN.

https://arxiv.org/pdf/1611.01576.pdf

Part of the code is adapted from: https://github.com/salesforce/pytorch-qrnn

Parameters

hidden_size (int) – The number of features in the hidden state h. If not specified, the input size is used.
input_shape (tuple) – The shape of an example input. Alternatively, use input_size.
input_size (int) – The size of the input. Alternatively, use input_shape.
num_layers (int) – The number of QRNN layers to produce.
zoneout (bool) – Whether to apply zoneout (i.e. failing to update elements in the hidden state) to the hidden state updates. Default: 0.
output_gate (bool) – If True, performs QRNN-fo (applying an output gate to the output). If False, performs QRNN-f. Default: True.

Example

>>> a = torch.rand([8, 120, 40])
>>> model = QuasiRNN(
...     256, num_layers=4, input_shape=a.shape, bidirectional=True
... )
>>> b, _ = model(a)
>>> b.shape
torch.Size([8, 120, 512])

training: bool¶

forward(x, hidden=None)[source]¶

speechbrain.nnet.RNN.rnn_init(module)[source]¶

This function is used to initialize the RNN weight. Recurrent connection: orthogonal initialization.

Parameters: module (torch.nn.Module) – Recurrent neural network module.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = RNN(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor = net(inp_tensor)
>>> rnn_init(net)