This lobes replicate the encoder first introduced in ESPNET v1
Titouan Parcollet 2020
This model is a combination of CNNs and RNNs following
- class speechbrain.lobes.models.ESPnetVGG.ESPnetVGG(input_shape, activation=<class 'torch.nn.modules.activation.ReLU'>, dropout=0.15, cnn_channels=[64, 128], rnn_class=<class 'speechbrain.nnet.RNN.LSTM'>, rnn_layers=4, rnn_neurons=512, rnn_bidirectional=True, rnn_re_init=False, projection_neurons=512)
- This model is a combination of CNNs and RNNs following
the ESPnet encoder. (VGG+RNN+MLP+tanh())
input_shape (tuple) – The shape of an example expected input.
activation (torch class) – A class used for constructing the activation layers. For CNN and DNN.
dropout (float) – Neuron dropout rate, applied to RNN only.
cnn_channels (list of ints) – A list of the number of output channels for each CNN block.
rnn_class (torch class) – The type of RNN to use (LiGRU, LSTM, GRU, RNN)
rnn_layers (int) – The number of recurrent layers to include.
rnn_neurons (int) – Number of neurons in each layer of the RNN.
rnn_bidirectional (bool) – Whether this model will process just forward or both directions.
projection_neurons (int) – The number of neurons in the last linear layer.
>>> inputs = torch.rand([10, 40, 60]) >>> model = ESPnetVGG(input_shape=inputs.shape) >>> outputs = model(inputs) >>> outputs.shape torch.Size([10, 10, 512])