speechbrain.nnet.RNN module

Library implementing recurrent neural networks.

Authors
  • Mirco Ravanelli 2020

  • Ju-Chieh Chou 2020

  • Jianyuan Zhong 2020

  • Loren Lugosch 2020

Summary

Classes:

AttentionalRNNDecoder

This function implements RNN decoder model with attention.

GRU

This function implements a basic GRU.

GRUCell

This class implements a basic GRU Cell for a timestep of input, while GRU() takes the whole sequence as input.

LSTM

This function implements a basic LSTM.

LSTMCell

This class implements a basic LSTM Cell for a timestep of input, while LSTM() takes the whole sequence as input.

LiGRU

This function implements a Light GRU (liGRU).

LiGRU_Layer

This function implements Light-Gated Recurrent Units (ligru) layer.

QuasiRNN

This is a implementation for the Quasi-RNN.

QuasiRNNLayer

Applies a single layer Quasi-Recurrent Neural Network (QRNN) to an input sequence.

RNN

This function implements a vanilla RNN.

RNNCell

This class implements a basic RNN Cell for a timestep of input, while RNN() takes the whole sequence as input.

Functions:

pack_padded_sequence

Returns packed speechbrain-formatted tensors.

pad_packed_sequence

Returns speechbrain-formatted tensor from packed sequences.

rnn_init

This function is used to initialize the RNN weight.

Reference

speechbrain.nnet.RNN.pack_padded_sequence(inputs, lengths)[source]

Returns packed speechbrain-formatted tensors.

Parameters
speechbrain.nnet.RNN.pad_packed_sequence(inputs)[source]

Returns speechbrain-formatted tensor from packed sequences.

Parameters

inputs (torch.nn.utils.rnn.PackedSequence) – An input set of sequences to convert to a tensor.

class speechbrain.nnet.RNN.RNN(hidden_size, input_shape=None, input_size=None, nonlinearity='relu', num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements a vanilla RNN.

It accepts in input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

Parameters
  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • nonlinearity (str) – Type of nonlinearity (tanh, relu).

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = RNN(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>>
torch.Size([4, 10, 5])
forward(x, hx=None, lengths=None)[source]

Returns the output of the vanilla RNN.

Parameters
training: bool
class speechbrain.nnet.RNN.LSTM(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements a basic LSTM.

It accepts in input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

Parameters
  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – It True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = LSTM(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor = net(inp_tensor)
>>>
torch.Size([4, 10, 5])
forward(x, hx=None, lengths=None)[source]

Returns the output of the LSTM.

Parameters
training: bool
class speechbrain.nnet.RNN.GRU(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements a basic GRU.

It accepts input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

Parameters
  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • t (dropou) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = GRU(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>>
torch.Size([4, 10, 5])
forward(x, hx=None, lengths=None)[source]

Returns the output of the GRU.

Parameters
training: bool
class speechbrain.nnet.RNN.RNNCell(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True, nonlinearity='tanh')[source]

Bases: torch.nn.modules.module.Module

This class implements a basic RNN Cell for a timestep of input, while RNN() takes the whole sequence as input.

It is designed for an autoregressive decoder (ex. attentional decoder), which takes one input at a time. Using torch.nn.RNNCell() instead of torch.nn.RNN() to reduce VRAM consumption.

It accepts in input tensors formatted as (batch, fea).

Parameters
  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – It True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

Example

>>> inp_tensor = torch.rand([4, 20])
>>> net = RNNCell(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 5])
forward(x, hx=None)[source]

Returns the output of the RNNCell.

Parameters
training: bool
class speechbrain.nnet.RNN.GRUCell(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True)[source]

Bases: torch.nn.modules.module.Module

This class implements a basic GRU Cell for a timestep of input, while GRU() takes the whole sequence as input.

It is designed for an autoregressive decoder (ex. attentional decoder), which takes one input at a time. Using torch.nn.GRUCell() instead of torch.nn.GRU() to reduce VRAM consumption. It accepts in input tensors formatted as (batch, fea).

Parameters
  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the GRU architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – It True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

Example

>>> inp_tensor = torch.rand([4, 20])
>>> net = GRUCell(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 5])
forward(x, hx=None)[source]

Returns the output of the GRUCell.

Parameters
training: bool
class speechbrain.nnet.RNN.LSTMCell(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, dropout=0.0, re_init=True)[source]

Bases: torch.nn.modules.module.Module

This class implements a basic LSTM Cell for a timestep of input, while LSTM() takes the whole sequence as input.

It is designed for an autoregressive decoder (ex. attentional decoder), which takes one input at a time. Using torch.nn.LSTMCell() instead of torch.nn.LSTM() to reduce VRAM consumption. It accepts in input tensors formatted as (batch, fea).

Parameters
  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output).

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – Number of layers to employ in the LSTM architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

Example

>>> inp_tensor = torch.rand([4, 20])
>>> net = LSTMCell(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor, _ = net(inp_tensor)
>>> out_tensor.shape
torch.Size([4, 5])
forward(x, hx=None)[source]

Returns the output of the LSTMCell.

Parameters
training: bool
class speechbrain.nnet.RNN.AttentionalRNNDecoder(rnn_type, attn_type, hidden_size, attn_dim, num_layers, enc_dim, input_size, nonlinearity='relu', re_init=True, normalization='batchnorm', scaling=1.0, channels=None, kernel_size=None, bias=True, dropout=0.0)[source]

Bases: torch.nn.modules.module.Module

This function implements RNN decoder model with attention.

This function implements different RNN models. It accepts in enc_states tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened in this way: (batch, time, fea*channel).

Parameters
  • rnn_type (str) – Type of recurrent neural network to use (rnn, lstm, gru).

  • attn_type (str) – type of attention to use (location, content).

  • hidden_size (int) – Number of the neurons.

  • attn_dim (int) – Number of attention module internal and output neurons.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • input_shape (tuple) – Expected shape of an input.

  • input_size (int) – Expected size of the relevant input dimension.

  • nonlinearity (str) – Type of nonlinearity (tanh, relu). This option is active for rnn and ligru models only. For lstm and gru tanh is used.

  • re_init (bool) – It True, orthogonal init is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • normalization (str) – Type of normalization for the ligru model (batchnorm, layernorm). Every string different from batchnorm and layernorm will result in no normalization.

  • scaling (float) – A scaling factor to sharpen or smoothen the attention distribution.

  • channels (int) – Number of channels for location-aware attention.

  • kernel_size (int) – Size of the kernel for location-aware attention.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

Example

>>> enc_states = torch.rand([4, 10, 20])
>>> wav_len = torch.rand([4])
>>> inp_tensor = torch.rand([4, 5, 6])
>>> net = AttentionalRNNDecoder(
...     rnn_type="lstm",
...     attn_type="content",
...     hidden_size=7,
...     attn_dim=5,
...     num_layers=1,
...     enc_dim=20,
...     input_size=6,
... )
>>> out_tensor, attn = net(inp_tensor, enc_states, wav_len)
>>> out_tensor.shape
torch.Size([4, 5, 7])
forward_step(inp, hs, c, enc_states, enc_len)[source]

One step of forward pass process.

Parameters
  • inp (torch.Tensor) – The input of current timestep.

  • hs (torch.Tensor or tuple of torch.Tensor) – The cell state for RNN.

  • c (torch.Tensor) – The context vector of previous timestep.

  • enc_states (torch.Tensor) – The tensor generated by encoder, to be attended.

  • enc_len (torch.LongTensor) – The actual length of encoder states.

Returns

  • dec_out (torch.Tensor) – The output tensor.

  • hs (torch.Tensor or tuple of torch.Tensor) – The new cell state for RNN.

  • c (torch.Tensor) – The context vector of the current timestep.

  • w (torch.Tensor) – The weight of attention.

forward(inp_tensor, enc_states, wav_len)[source]

This method implements the forward pass of the attentional RNN decoder.

Parameters
  • inp_tensor (torch.Tensor) – The input tensor for each timesteps of RNN decoder.

  • enc_states (torch.Tensor) – The tensor to be attended by the decoder.

  • wav_len (torch.Tensor) – This variable stores the relative length of wavform.

Returns

  • outputs (torch.Tensor) – The output of the RNN decoder.

  • attn (torch.Tensor) – The attention weight of each timestep.

training: bool
class speechbrain.nnet.RNN.LiGRU(hidden_size, input_shape, nonlinearity='relu', normalization='batchnorm', num_layers=1, bias=True, dropout=0.0, re_init=True, bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements a Light GRU (liGRU).

Ligru is single-gate GRU model based on batch-norm + relu activations + recurrent dropout. For more info see:

“M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, Light Gated Recurrent Units for Speech Recognition, in IEEE Transactions on Emerging Topics in Computational Intelligence, 2018” (https://arxiv.org/abs/1803.10225)

To speed it up, it is compiled with the torch just-in-time compiler (jit) right before using it.

It accepts in input tensors formatted as (batch, time, fea). In the case of 4d inputs like (batch, time, fea, channel) the tensor is flattened as (batch, time, fea*channel).

Parameters
  • hidden_size (int) – Number of output neurons (i.e, the dimensionality of the output). values (i.e, time and frequency kernel sizes respectively).

  • input_shape (tuple) – The shape of an example input.

  • nonlinearity (str) – Type of nonlinearity (tanh, relu).

  • normalization (str) – Type of normalization for the ligru model (batchnorm, layernorm). Every string different from batchnorm and layernorm will result in no normalization.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • bias (bool) – If True, the additive bias b is adopted.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • re_init (bool) – If True, orthogonal initialization is used for the recurrent weights. Xavier initialization is used for the input connection weights.

  • bidirectional (bool) – If True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = LiGRU(input_shape=inp_tensor.shape, hidden_size=5)
>>> out_tensor, _ = net(inp_tensor)
>>>
torch.Size([4, 10, 5])
forward(x, hx: Optional[torch.Tensor] = None)[source]

Returns the output of the liGRU.

Parameters
training: bool
class speechbrain.nnet.RNN.LiGRU_Layer(input_size, hidden_size, num_layers, batch_size, dropout=0.0, nonlinearity='relu', normalization='batchnorm', bidirectional=False)[source]

Bases: torch.nn.modules.module.Module

This function implements Light-Gated Recurrent Units (ligru) layer.

Parameters
  • input_size (int) – Feature dimensionality of the input tensors.

  • batch_size (int) – Batch size of the input tensors.

  • hidden_size (int) – Number of output neurons.

  • num_layers (int) – Number of layers to employ in the RNN architecture.

  • nonlinearity (str) – Type of nonlinearity (tanh, relu).

  • normalization (str) – Type of normalization (batchnorm, layernorm). Every string different from batchnorm and layernorm will result in no normalization.

  • dropout (float) – It is the dropout factor (must be between 0 and 1).

  • bidirectional (bool) – if True, a bidirectional model that scans the sequence both right-to-left and left-to-right is used.

forward(x: torch.Tensor, hx: Optional[torch.Tensor] = None)torch.Tensor[source]

Returns the output of the liGRU layer.

Parameters

x (torch.Tensor) – Input tensor.

training: bool
class speechbrain.nnet.RNN.QuasiRNNLayer(input_size, hidden_size, bidirectional, zoneout=0.0, output_gate=True)[source]

Bases: torch.nn.modules.module.Module

Applies a single layer Quasi-Recurrent Neural Network (QRNN) to an input sequence.

Parameters
  • input_size (int) – The number of expected features in the input x.

  • hidden_size (int) – The number of features in the hidden state h. If not specified, the input size is used.

  • zoneout (float) – Whether to apply zoneout (i.e. failing to update elements in the hidden state) to the hidden state updates. Default: 0.

  • output_gate (bool) – If True, performs QRNN-fo (applying an output gate to the output). If False, performs QRNN-f. Default: True.

Example

>>> import torch
>>> model = QuasiRNNLayer(60, 256, bidirectional=True)
>>> a = torch.rand([10, 120, 60])
>>> b = model(a)
>>> b[0].shape
torch.Size([10, 120, 512])
training: bool
forgetMult(f: torch.Tensor, x: torch.Tensor, hidden: Optional[torch.Tensor])torch.Tensor[source]

Returns the hidden states for each time step.

Parameters

wx (torch.Tensor) – Linearly transformed input.

split_gate_inputs(y: Tensor)Tuple[Tensor, Tensor, Optional[Tensor]][source]
forward(x: Tensor, hidden: Optional[Tensor] = None)Tuple[Tensor, Tensor][source]

Returns the output of the QRNN layer.

Parameters

x (torch.Tensor) – Input to transform linearly.

class speechbrain.nnet.RNN.QuasiRNN(hidden_size, input_shape=None, input_size=None, num_layers=1, bias=True, batch_first=False, dropout=0, bidirectional=False, **kwargs)[source]

Bases: torch.nn.modules.module.Module

This is a implementation for the Quasi-RNN.

https://arxiv.org/pdf/1611.01576.pdf

Part of the code is adapted from: https://github.com/salesforce/pytorch-qrnn

Parameters
  • hidden_size (int) – The number of features in the hidden state h. If not specified, the input size is used.

  • input_shape (tuple) – The shape of an example input. Alternatively, use input_size.

  • input_size (int) – The size of the input. Alternatively, use input_shape.

  • num_layers (int) – The number of QRNN layers to produce.

  • zoneout (bool) – Whether to apply zoneout (i.e. failing to update elements in the hidden state) to the hidden state updates. Default: 0.

  • output_gate (bool) – If True, performs QRNN-fo (applying an output gate to the output). If False, performs QRNN-f. Default: True.

Example

>>> a = torch.rand([8, 120, 40])
>>> model = QuasiRNN(
...     256, num_layers=4, input_shape=a.shape, bidirectional=True
... )
>>> b, _ = model(a)
>>> b.shape
torch.Size([8, 120, 512])
training: bool
forward(x, hidden=None)[source]
speechbrain.nnet.RNN.rnn_init(module)[source]

This function is used to initialize the RNN weight. Recurrent connection: orthogonal initialization.

Parameters

module (torch.nn.Module) – Recurrent neural network module.

Example

>>> inp_tensor = torch.rand([4, 10, 20])
>>> net = RNN(hidden_size=5, input_shape=inp_tensor.shape)
>>> out_tensor = net(inp_tensor)
>>> rnn_init(net)