speechbrain.decoders.transducer module

Decoders and output normalization for Transducer sequence.

Author:: Abdelwahab HEBA 2020 Sung-Lin Yeh 2020

Summary

Classes:

TransducerBeamSearcher

This class implements the beam-search algorithm for the transducer model.

Reference

class speechbrain.decoders.transducer.TransducerBeamSearcher(decode_network_lst, tjoint, classifier_network, blank_id, beam_size=4, nbest=5, lm_module=None, lm_weight=0.0, state_beam=2.3, expand_beam=2.3)[source]

Bases: Module

This class implements the beam-search algorithm for the transducer model.

Parameters

decode_network_lst (list) – List of prediction network (PN) layers.
tjoint (transducer_joint module) – This module perform the joint between TN and PN.
classifier_network (list) – List of output layers (after performing joint between TN and PN) exp: (TN,PN) => joint => classifier_network_list [DNN bloc, Linear..] => chars prob
blank_id (int) – The blank symbol/index.
beam (int) – The width of beam. Greedy Search is used when beam = 1.
nbest (int) – Number of hypotheses to keep.
lm_module (torch.nn.ModuleList) – Neural networks modules for LM.
lm_weight (float) – The weight of LM when performing beam search (λ). log P(y|x) + λ log P_LM(y). (default: 0.3)
state_beam (float) – The threshold coefficient in log space to decide if hyps in A (process_hyps) is likely to compete with hyps in B (beam_hyps), if not, end the while loop. Reference: https://arxiv.org/pdf/1911.01629.pdf
expand_beam (float) – The threshold coefficient to limit the number of expanded hypotheses that are added in A (process_hyp). Reference: https://arxiv.org/pdf/1911.01629.pdf Reference: https://github.com/kaldi-asr/kaldi/blob/master/src/decoder/simple-decoder.cc (See PruneToks)

Example

searcher = TransducerBeamSearcher(: decode_network_lst=[hparams[“emb”], hparams[“dec”]], tjoint=hparams[“Tjoint”], classifier_network=[hparams[“transducer_lin”]], blank_id=0, beam_size=hparams[“beam_size”], nbest=hparams[“nbest”], lm_module=hparams[“lm_model”], lm_weight=hparams[“lm_weight”], state_beam=2.3, expand_beam=2.3,

) >>> from speechbrain.nnet.transducer.transducer_joint import Transducer_joint >>> import speechbrain as sb >>> emb = sb.nnet.embedding.Embedding( … num_embeddings=35, … embedding_dim=3, … consider_as_one_hot=True, … blank_id=0 … ) >>> dec = sb.nnet.RNN.GRU( … hidden_size=10, input_shape=(1, 40, 34), bidirectional=False … ) >>> lin = sb.nnet.linear.Linear(input_shape=(1, 40, 10), n_neurons=35) >>> joint_network= sb.nnet.linear.Linear(input_shape=(1, 1, 40, 35), n_neurons=35) >>> tjoint = Transducer_joint(joint_network, joint=”sum”) >>> searcher = TransducerBeamSearcher( … decode_network_lst=[emb, dec], … tjoint=tjoint, … classifier_network=[lin], … blank_id=0, … beam_size=1, … nbest=1, … lm_module=None, … lm_weight=0.0, … ) >>> enc = torch.rand([1, 20, 10]) >>> hyps, scores, _, _ = searcher(enc)

forward(tn_output)[source]

Parameters: tn_output (torch.tensor) – Output from transcription network with shape [batch, time_len, hiddens].
Return type: Topk hypotheses

transducer_greedy_decode(tn_output)[source]

Transducer greedy decoder is a greedy decoder over batch which apply Transducer rules:

1- for each time step in the Transcription Network (TN) output:

-> Update the ith utterance only if: the previous target != the new one (we save the hiddens and the target)

-> otherwise: —> keep the previous target prediction from the decoder

Parameters: tn_output (torch.tensor) – Output from transcription network with shape [batch, time_len, hiddens].
Returns: Outputs a logits tensor [B,T,1,Output_Dim]; padding has not been removed.
Return type: torch.tensor

transducer_beam_search_decode(tn_output)[source]

Transducer beam search decoder is a beam search decoder over batch which apply Transducer rules:

1- for each utterance:

2- for each time steps in the Transcription Network (TN) output:: -> Do forward on PN and Joint network -> Select topK <= beam -> Do a while loop extending the hyps until we reach blank

-> otherwise: –> extend hyp by the new token

Parameters: tn_output (torch.tensor) – Output from transcription network with shape [batch, time_len, hiddens].
Returns: Outputs a logits tensor [B,T,1,Output_Dim]; padding has not been removed.
Return type: torch.tensor

training: bool