speechbrain.decoders.transducer module

Decoders and output normalization for Transducer sequence.

Author:

Abdelwahab HEBA 2020 Sung-Lin Yeh 2020

Summary

Classes:

TransducerBeamSearcher

This class implements the beam-search algorithm for the transducer model.

Reference

class speechbrain.decoders.transducer.TransducerBeamSearcher(decode_network_lst, tjoint, classifier_network, blank_id, beam_size=4, nbest=5, lm_module=None, lm_weight=0.0, state_beam=2.3, expand_beam=2.3)[source]

Bases: torch.nn.modules.module.Module

This class implements the beam-search algorithm for the transducer model.

Parameters
  • decode_network_lst (list) – List of prediction network (PN) layers.

  • tjoint (transducer_joint module) – This module perform the joint between TN and PN.

  • classifier_network (list) – List of output layers (after performing joint between TN and PN) exp: (TN,PN) => joint => classifier_network_list [DNN bloc, Linear..] => chars prob

  • blank_id (int) – The blank symbol/index.

  • beam (int) – The width of beam. Greedy Search is used when beam = 1.

  • nbest (int) – Number of hypotheses to keep.

  • lm_module (torch.nn.ModuleList) – Neural networks modules for LM.

  • lm_weight (float) – The weight of LM when performing beam search (λ). log P(y|x) + λ log P_LM(y). (default: 0.3)

  • state_beam (float) – The threshold coefficient in log space to decide if hyps in A (process_hyps) is likely to compete with hyps in B (beam_hyps), if not, end the while loop. Reference: https://arxiv.org/pdf/1911.01629.pdf

  • expand_beam (float) – The threshold coefficient to limit the number of expanded hypotheses that are added in A (process_hyp). Reference: https://arxiv.org/pdf/1911.01629.pdf Reference: https://github.com/kaldi-asr/kaldi/blob/master/src/decoder/simple-decoder.cc (See PruneToks)

Example

searcher = TransducerBeamSearcher(

decode_network_lst=[hparams[“emb”], hparams[“dec”]], tjoint=hparams[“Tjoint”], classifier_network=[hparams[“transducer_lin”]], blank_id=0, beam_size=hparams[“beam_size”], nbest=hparams[“nbest”], lm_module=hparams[“lm_model”], lm_weight=hparams[“lm_weight”], state_beam=2.3, expand_beam=2.3,

) >>> from speechbrain.nnet.transducer.transducer_joint import Transducer_joint >>> import speechbrain as sb >>> emb = sb.nnet.embedding.Embedding( … num_embeddings=35, … embedding_dim=3, … consider_as_one_hot=True, … blank_id=0 … ) >>> dec = sb.nnet.RNN.GRU( … hidden_size=10, input_shape=(1, 40, 34), bidirectional=False … ) >>> lin = sb.nnet.linear.Linear(input_shape=(1, 40, 10), n_neurons=35) >>> joint_network= sb.nnet.linear.Linear(input_shape=(1, 1, 40, 35), n_neurons=35) >>> tjoint = Transducer_joint(joint_network, joint=”sum”) >>> searcher = TransducerBeamSearcher( … decode_network_lst=[emb, dec], … tjoint=tjoint, … classifier_network=[lin], … blank_id=0, … beam_size=1, … nbest=1, … lm_module=None, … lm_weight=0.0, … ) >>> enc = torch.rand([1, 20, 10]) >>> hyps, scores, _, _ = searcher(enc)

forward(tn_output)[source]
Parameters

tn_output (torch.tensor) – Output from transcription network with shape [batch, time_len, hiddens].

Return type

Topk hypotheses

transducer_greedy_decode(tn_output)[source]
Transducer greedy decoder is a greedy decoder over batch which apply Transducer rules:
1- for each time step in the Transcription Network (TN) output:
-> Update the ith utterance only if

the previous target != the new one (we save the hiddens and the target)

-> otherwise: —> keep the previous target prediction from the decoder

Parameters

tn_output (torch.tensor) – Output from transcription network with shape [batch, time_len, hiddens].

Returns

Outputs a logits tensor [B,T,1,Output_Dim]; padding has not been removed.

Return type

torch.tensor

transducer_beam_search_decode(tn_output)[source]
Transducer beam search decoder is a beam search decoder over batch which apply Transducer rules:
1- for each utterance:
2- for each time steps in the Transcription Network (TN) output:

-> Do forward on PN and Joint network -> Select topK <= beam -> Do a while loop extending the hyps until we reach blank

-> otherwise: –> extend hyp by the new token

Parameters

tn_output (torch.tensor) – Output from transcription network with shape [batch, time_len, hiddens].

Returns

Outputs a logits tensor [B,T,1,Output_Dim]; padding has not been removed.

Return type

torch.tensor

training: bool