speechbrain.decoders.transducer module

Decoders and output normalization for Transducer sequence.

Author:: Abdelwahab HEBA 2020 Sung-Lin Yeh 2020

Summary

Classes:

`TransducerBeamSearcher`	This class implements the beam-search algorithm for the transducer model.
`TransducerGreedySearcherStreamingContext`	Simple wrapper for the hidden state of the transducer greedy searcher.

Functions:

get_transducer_key

Argument function to customize the sort order (in sorted & max).

Reference

class speechbrain.decoders.transducer.TransducerGreedySearcherStreamingContext(hidden: Any | None = None)[source]

Bases: Module

Simple wrapper for the hidden state of the transducer greedy searcher. Used by transducer_greedy_decode_streaming().

hidden: Any | None = None: Hidden state; typically a tensor or a tuple of tensors.

class speechbrain.decoders.transducer.TransducerBeamSearcher(decode_network_lst, tjoint, classifier_network, blank_id, beam_size=4, nbest=5, lm_module=None, lm_weight=0.0, state_beam=2.3, expand_beam=2.3)[source]

Bases: Module

This class implements the beam-search algorithm for the transducer model.

Parameters:

decode_network_lst (list) – List of prediction network (PN) layers.
tjoint (transducer_joint module) – This module perform the joint between TN and PN.
classifier_network (list) – List of output layers (after performing joint between TN and PN) exp: (TN,PN) => joint => classifier_network_list [DNN block, Linear..] => chars prob
blank_id (int) – The blank symbol/index.
beam_size (int) – The width of beam. Greedy Search is used when beam_size = 1.
nbest (int) – Number of hypotheses to keep.
lm_module (torch.nn.ModuleList) – Neural networks modules for LM.
lm_weight (float) – The weight of LM when performing beam search (λ). log P(y|x) + λ log P_LM(y). (default: 0.3)
state_beam (float) – The threshold coefficient in log space to decide if hyps in A (process_hyps) is likely to compete with hyps in B (beam_hyps), if not, end the while loop. Reference: https://arxiv.org/pdf/1911.01629.pdf
expand_beam (float) – The threshold coefficient to limit the number of expanded hypotheses that are added in A (process_hyp). Reference: https://arxiv.org/pdf/1911.01629.pdf Reference: https://github.com/kaldi-asr/kaldi/blob/master/src/decoder/simple-decoder.cc (See PruneToks)

Example

searcher = TransducerBeamSearcher(: decode_network_lst=[hparams[“emb”], hparams[“dec”]], tjoint=hparams[“Tjoint”], classifier_network=[hparams[“transducer_lin”]], blank_id=0, beam_size=hparams[“beam_size”], nbest=hparams[“nbest”], lm_module=hparams[“lm_model”], lm_weight=hparams[“lm_weight”], state_beam=2.3, expand_beam=2.3,

) >>> from speechbrain.nnet.transducer.transducer_joint import ( … Transducer_joint, … ) >>> import speechbrain as sb >>> emb = sb.nnet.embedding.Embedding( … num_embeddings=35, … embedding_dim=3, … consider_as_one_hot=True, … blank_id=0, … ) >>> dec = sb.nnet.RNN.GRU( … hidden_size=10, input_shape=(1, 40, 34), bidirectional=False … ) >>> lin = sb.nnet.linear.Linear(input_shape=(1, 40, 10), n_neurons=35) >>> joint_network = sb.nnet.linear.Linear( … input_shape=(1, 1, 40, 35), n_neurons=35 … ) >>> tjoint = Transducer_joint(joint_network, joint=”sum”) >>> searcher = TransducerBeamSearcher( … decode_network_lst=[emb, dec], … tjoint=tjoint, … classifier_network=[lin], … blank_id=0, … beam_size=1, … nbest=1, … lm_module=None, … lm_weight=0.0, … ) >>> enc = torch.rand([1, 20, 10]) >>> hyps, _, _, _ = searcher(enc)

forward(tn_output)[source]

Parameters:: tn_output (torch.Tensor) – Output from transcription network with shape [batch, time_len, hiddens].
Return type:: Topk hypotheses

transducer_greedy_decode(tn_output, hidden_state=None, return_hidden=False, max_symbols_per_step=5)[source]

Transducer greedy decoder is a greedy decoder over batch which apply Transducer rules:

1- for each time step in the Transcription Network (TN) output:

-> Update the ith utterance only if: the previous target != the new one (we save the hiddens and the target)

-> otherwise: —> keep the previous target prediction from the decoder

Parameters:

tn_output (torch.Tensor) – Output from transcription network with shape [batch, time_len, hiddens].
hidden_state ((torch.Tensor, torch.Tensor)) – Hidden state to initially feed the decode network with. This is useful in conjunction with return_hidden to be able to perform beam search in a streaming context, so that you can reuse the last hidden state as an initial state across calls.
return_hidden (bool) – Whether the return tuple should contain an extra 5th element with the hidden state at of the last step. See hidden_state.
max_symbols_per_step (int) – Maximum number of non-blank symbols to decode per time step. This is useful to avoid infinite loops.

Returns:

Tuple of 4 or 5 elements (if return_hidden).
First element (List[List[int]]) – List of decoded tokens
Second element (torch.Tensor) – Outputs a logits tensor [B,T,1,Output_Dim]; padding has not been removed.
Third element (None) – nbest; irrelevant for greedy decode
Fourth element (None) – nbest scores; irrelevant for greedy decode
Fifth element (Present if return_hidden, (torch.Tensor, torch.Tensor)) – Tuple representing the hidden state required to call transducer_greedy_decode where you left off in a streaming context.

transducer_greedy_decode_streaming(x: Tensor, context: TransducerGreedySearcherStreamingContext)[source]

Tiny wrapper for transducer_greedy_decode() with an API that makes it suitable to be passed as a decoding_function for streaming.

Parameters:

x (torch.Tensor) – Outputs of the prediction network (equivalent to tn_output)
context (TransducerGreedySearcherStreamingContext) – Mutable streaming context object, which must be specified and reused across calls when streaming. You can obtain an initial context by initializing a default object.

Returns:

hyp

Return type:

torch.Tensor

transducer_beam_search_decode(tn_output)[source]

Transducer beam search decoder is a beam search decoder over batch which apply Transducer rules:

1- for each utterance:

2- for each time steps in the Transcription Network (TN) output:: -> Do forward on PN and Joint network -> Select topK <= beam -> Do a while loop extending the hyps until we reach blank

-> otherwise: –> extend hyp by the new token

Parameters:: tn_output (torch.Tensor) – Output from transcription network with shape [batch, time_len, hiddens].
Returns:: Outputs a logits tensor [B,T,1,Output_Dim]; padding has not been removed.
Return type:: torch.Tensor

speechbrain.decoders.transducer.get_transducer_key(x)[source]

Argument function to customize the sort order (in sorted & max). To be used as key=partial(get_transducer_key).

Parameters:: x (dict) – one of the items under comparison
Returns:: Normalized log-score.
Return type:: float