speechbrain.nnet.RNN module

Library implementing recurrent neural networks.

  • Mirco Ravanelli 2020

  • Ju-Chieh Chou 2020

  • Jianyuan Zhong 2020

  • Loren Lugosch 2020




This function implements RNN decoder model with attention.


This function implements a basic GRU.


This class implements a basic GRU Cell for a timestep of input, while GRU() takes the whole sequence as input.


This function implements a basic LSTM.


This class implements a basic LSTM Cell for a timestep of input, while LSTM() takes the whole sequence as input.


This function implements a Light GRU (liGRU).


This function implements Light-Gated Recurrent Units (ligru) layer.


This is a implementation for the Quasi-RNN.


Applies a single layer Quasi-Recurrent Neural Network (QRNN) to an input sequence.


This function implements a vanilla RNN.


This class implements a basic RNN Cell for a timestep of input, while RNN() takes the whole sequence as input.



Returns packed speechbrain-formatted tensors.


Returns speechbrain-formatted tensor from packed sequences.


This function is used to initialize the RNN weight.


speechbrain.nnet.RNN.pack_padded_sequence(inputs, lengths)[source]

Returns packed speechbrain-formatted tensors.


Returns speechbrain-formatted tensor from packed sequences.


inputs (torch.nn.utils.rnn.PackedSequence) – An input set of sequences to convert to a tensor.

class speechbrain.nnet.RNN.RNN(hidden_size, input_shape=None, input_size=None, nonlinearity='relu', num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements a vanilla RNN.

It accepts in input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • nonlinearity (str) – Type of nonlinearity (tanh, relu).

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.


>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = RNN(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
torch.Size([4, 10, 5])
forward(x, hx=None, lengths=None)[source]

Returns the output of the vanilla RNN.

training: bool
class speechbrain.nnet.RNN.LSTM(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements a basic LSTM.

It accepts in input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – It True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.


>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = LSTM(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor = net(inp_tensor)
torch.Size([4, 10, 5])
forward(x, hx=None, lengths=None)[source]

Returns the output of the LSTM.

training: bool
class speechbrain.nnet.RNN.GRU(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements a basic GRU.

It accepts input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • t (dropou) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.


>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = GRU(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
torch.Size([4, 10, 5])
forward(x, hx=None, lengths=None)[source]

Returns the output of the GRU.

training: bool
class speechbrain.nnet.RNN.RNNCell(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True, nonlinearity='tanh')[source]

Bases: torch.nn.modules.module.Module

This class implements a basic RNN Cell for a timestep of input, while RNN() takes the whole sequence as input.

It is designed for an autoregressive decoder (ex. attentional decoder), which takes one input at a time. Using torch.nn.RNNCell() instead of torch.nn.RNN() to reduce VRAM consumption.

It accepts in input tensors formatted as (batch, fea).

  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – It True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.


>>> inp_tensor = torch.rand([4, 20])
>>> net = RNNCell(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 5])
forward(x, hx=None)[source]

Returns the output of the RNNCell.

training: bool
class speechbrain.nnet.RNN.GRUCell(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True)[source]

Bases: torch.nn.modules.module.Module

This class implements a basic GRU Cell for a timestep of input, while GRU() takes the whole sequence as input.

It is designed for an autoregressive decoder (ex. attentional decoder), which takes one input at a time. Using torch.nn.GRUCell() instead of torch.nn.GRU() to reduce VRAM consumption. It accepts in input tensors formatted as (batch, fea).

  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the GRU architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – It True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.


>>> inp_tensor = torch.rand([4, 20])
>>> net = GRUCell(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 5])
forward(x, hx=None)[source]

Returns the output of the GRUCell.

training: bool
class speechbrain.nnet.RNN.LSTMCell(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True)[source]

Bases: torch.nn.modules.module.Module

This class implements a basic LSTM Cell for a timestep of input, while LSTM() takes the whole sequence as input.

It is designed for an autoregressive decoder (ex. attentional decoder), which takes one input at a time. Using torch.nn.LSTMCell() instead of torch.nn.LSTM() to reduce VRAM consumption. It accepts in input tensors formatted as (batch, fea).

  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the LSTM architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.


>>> inp_tensor = torch.rand([4, 20])
>>> net = LSTMCell(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 5])
forward(x, hx=None)[source]

Returns the output of the LSTMCell.

training: bool
class speechbrain.nnet.RNN.AttentionalRNNDecoder(rnn_type, attn_type, hidden_size, attn_dim, num_layers, enc_dim, input_size, nonlinearity='relu', re_init=True, normalization='batchnorm', scaling=1.0, channels=None, kernel_size=None, bias=True, dropout=0.0)[source]

Bases: torch.nn.modules.module.Module

This function implements RNN decoder model with attention.

This function implements different RNN models. It accepts in enc_states tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened in this way: (batch, time, fea*channel).

  • rnn_type (str) – Type of recurrent neural network to use (rnn, lstm, gru).

  • attn_type (str) – type of attention to use (location, content).

  • hidden_size (int) – Number of the neurons.

  • attn_dim (int) – Number of attention module internal and output neurons.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • input_shape (tuple) – Expected shape of an input.

  • input_size (int) – Expected size of the relevant input dimension.

  • nonlinearity (str) – Type of nonlinearity (tanh, relu). This option is active for rnn and ligru models only. For lstm and gru tanh is used.

  • re_init (bool) – It True, orthogonal init is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • normalization (str) – Type of normalization for the ligru model (batchnorm, layernorm). Every string different from batchnorm and layernorm will result in no normalization.

  • scaling (float) – A scaling factor to sharpen or smoothen the attention distribution.

  • channels (int) – Number of channels for location-aware attention.

  • kernel_size (int) – Size of the kernel for location-aware attention.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).


>>> enc_states = torch.rand([4, 10, 20])
>>> wav_len = torch.rand([4])
>>> inp_tensor = torch.rand([4, 5, 6])
>>> net = AttentionalRNNDecoder(
...     rnn_type="lstm",
...     attn_type="content",
...     hidden_size=7,
...     attn_dim=5,
...     num_layers=1,
...     enc_dim=20,
...     input_size=6,
... )
>>> out_tensor, attn = net(inp_tensor, enc_states, wav_len)
>>> out_tensor.shape
torch.Size([4, 5, 7])
forward_step(inp, hs, c, enc_states, enc_len)[source]

One step of forward pass process.

  • inp (torch.Tensor) – The input of current timestep.

  • hs (torch.Tensor or tuple of torch.Tensor) – The cell state for RNN.

  • c (torch.Tensor) – The context vector of previous timestep.

  • enc_states (torch.Tensor) – The tensor generated by encoder, to be attended.

  • enc_len (torch.LongTensor) – The actual length of encoder states.


  • dec_out (torch.Tensor) – The output tensor.

  • hs (torch.Tensor or tuple of torch.Tensor) – The new cell state for RNN.

  • c (torch.Tensor) – The context vector of the current timestep.

  • w (torch.Tensor) – The weight of attention.

forward(inp_tensor, enc_states, wav_len)[source]

This method implements the forward pass of the attentional RNN decoder.

  • inp_tensor (torch.Tensor) – The input tensor for each timesteps of RNN decoder.

  • enc_states (torch.Tensor) – The tensor to be attended by the decoder.

  • wav_len (torch.Tensor) – This variable stores the relative length of wavform.


  • outputs (torch.Tensor) – The output of the RNN decoder.

  • attn (torch.Tensor) – The attention weight of each timestep.

training: bool
class speechbrain.nnet.RNN.LiGRU(hidden_size, input_shape, nonlinearity='relu', normalization='batchnorm', num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements a Light GRU (liGRU).

LiGRU is single-gate GRU model based on batch-norm + relu activations + recurrent dropout. For more info see:

“M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, Light Gated Recurrent Units for Speech Recognition, in IEEE Transactions on Emerging Topics in Computational Intelligence, 2018” (https://arxiv.org/abs/1803.10225)

This is a custm RNN and to speed it up it must be compiled with the torch just-in-time compiler (jit) right before using it. You can compile it with: compiled_model = torch.jit.script(model)

It accepts in input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).

  • input_shape (tuple) – The shape of an example input.

  • nonlinearity (str) – Type of nonlinearity (tanh, relu).

  • normalization (str) – Type of normalization for the ligru model (batchnorm, layernorm). Every string different from batchnorm and layernorm will result in no normalization.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.


>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = LiGRU(input_shape=inp_tensor.shape, hidden_size=5)
>>> out_tensor, _ = net(inp_tensor)
torch.Size([4, 10, 5])
forward(x, hx: Optional[torch.Tensor] = None)[source]

Returns the output of the liGRU.

training: bool
class speechbrain.nnet.RNN.LiGRU_Layer(input_size, hidden_size, num_layers, batch_size, dropout=0.0, nonlinearity='relu', normalization='batchnorm', bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements Light-Gated Recurrent Units (ligru) layer.

  • input_size (int) – Feature dimensionality of the input tensors.

  • batch_size (int) – Batch size of the input tensors.

  • hidden_size (int) – Number of output neurons.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • nonlinearity (str) – Type of nonlinearity (tanh, relu).

  • normalization (str) – Type of normalization (batchnorm, layernorm). Every string different from batchnorm and layernorm will result in no normalization.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • bidirectional (bool) – if True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

forward(x: torch.Tensor, hx: Optional[torch.Tensor] = None) torch.Tensor[source]

Returns the output of the liGRU layer.


x (torch.Tensor) – Input tensor.

training: bool
class speechbrain.nnet.RNN.QuasiRNNLayer(input_size, hidden_size, bidirectional, zoneout=0.0, output_gate=True)[source]

Bases: torch.nn.modules.module.Module

Applies a single layer Quasi-Recurrent Neural Network (QRNN) to an input sequence.

  • input_size (int) – The number of expected features in the input x.

  • hidden_size (int) – The number of features in the hidden state h. If not specified, the input size is used.

  • zoneout (float) – Whether to apply zoneout (i.e. failing to update elements in the hidden state) to the hidden state updates. Default: 0.

  • output_gate (bool) – If True, performs QRNN-fo (applying an output gate to the output). If False, performs QRNN-f. Default: True.


>>> import torch
>>> model = QuasiRNNLayer(60, 256, bidirectional=True)
>>> a = torch.rand([10, 120, 60])
>>> b = model(a)
>>> b[0].shape
torch.Size([10, 120, 512])
training: bool
forgetMult(f: torch.Tensor, x: torch.Tensor, hidden: Optional[torch.Tensor]) torch.Tensor[source]

Returns the hidden states for each time step.


wx (torch.Tensor) – Linearly transformed input.

split_gate_inputs(y: Tensor) Tuple[Tensor, Tensor, Optional[Tensor]][source]
forward(x: Tensor, hidden: Optional[Tensor] = None) Tuple[Tensor, Tensor][source]

Returns the output of the QRNN layer.


x (torch.Tensor) – Input to transform linearly.

class speechbrain.nnet.RNN.QuasiRNN(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, batch_first=False, dropout=0, bidirectional=False, **kwargs)[source]

Bases: torch.nn.modules.module.Module

This is a implementation for the Quasi-RNN.


Part of the code is adapted from: https://github.com/salesforce/pytorch-qrnn

  • hidden_size (int) – The number of features in the hidden state h. If not specified, the input size is used.

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – The number of QRNN layers to produce.

  • zoneout (bool) – Whether to apply zoneout (i.e. failing to update elements in the hidden state) to the hidden state updates. Default: 0.

  • output_gate (bool) – If True, performs QRNN-fo (applying an output gate to the output). If False, performs QRNN-f. Default: True.


>>> a = torch.rand([8, 120, 40])
>>> model = QuasiRNN(
...     256, num_layers=4, input_shape=a.shape, bidirectional=True
... )
>>> b, _ = model(a)
>>> b.shape
torch.Size([8, 120, 512])
training: bool
forward(x, hidden=None)[source]

This function is used to initialize the RNN weight. Recurrent connection: orthogonal initialization.


module (torch.nn.Module) – Recurrent neural network module.


>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = RNN(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor = net(inp_tensor)
>>> rnn_init(net)