speechbrain.decoders.scorer module

Token scorer abstraction and specifications.

Authors:

Adel Moumen 2022, 2023
Sung-Lin Yeh 2021

Summary

Classes:

`BaseRescorerInterface`	A scorer abstraction intended for inheritance by other scoring approaches used in beam search.
`BaseScorerInterface`	A scorer abstraction to be inherited by other scoring approaches for beam search.
`CTCScorer`	A wrapper of CTCPrefixScore based on the BaseScorerInterface.
`CoverageScorer`	A coverage penalty scorer to prevent looping of hyps, where `coverage` is the cumulative attention probability vector. Reference: https://arxiv.org/pdf/1612.02695.pdf, https://arxiv.org/pdf/1808.10792.pdf.
`HuggingFaceLMRescorer`	A wrapper of HuggingFace's TransformerLM based on the BaseRescorerInterface.
`KenLMScorer`	KenLM N-gram scorer.
`LengthScorer`	A length rewarding scorer.
`RNNLMRescorer`	A wrapper of RNNLM based on the BaseRescorerInterface.
`RNNLMScorer`	A wrapper of RNNLM based on BaseScorerInterface.
`RescorerBuilder`	Builds rescorer instance for beamsearch.
`ScorerBuilder`	Builds scorer instance for beamsearch.
`TransformerLMRescorer`	A wrapper of TransformerLM based on the BaseRescorerInterface.
`TransformerLMScorer`	A wrapper of TransformerLM based on BaseScorerInterface.

Reference

class speechbrain.decoders.scorer.BaseScorerInterface[source]

Bases: object

A scorer abstraction to be inherited by other scoring approaches for beam search.

A scorer is a module that scores tokens in vocabulary based on the current timestep input and the previous scorer states. It can be used to score on full vocabulary set (i.e., full scorers) or a pruned set of tokens (i.e. partial scorers) to prevent computation overhead. In the latter case, the partial scorers will be called after the full scorers. It will only scores the top-k candidates (i.e., pruned set of tokens) extracted from the full scorers. The top-k candidates are extracted based on the beam size and the scorer_beam_scale such that the number of candidates is int(beam_size * scorer_beam_scale). It can be very useful when the full scorers are computationally expensive (e.g., KenLM scorer).

Inherit this class to implement your own scorer compatible with speechbrain.decoders.seq2seq.S2SBeamSearcher().

See:

speechbrain.decoders.scorer.CTCPrefixScorer
speechbrain.decoders.scorer.RNNLMScorer
speechbrain.decoders.scorer.TransformerLMScorer
speechbrain.decoders.scorer.KenLMScorer
speechbrain.decoders.scorer.CoverageScorer
speechbrain.decoders.scorer.LengthScorer

score(inp_tokens, memory, candidates, attn)[source]

This method scores the new beams based on the informations of the current timestep.

A score is a tensor of shape (batch_size x beam_size, vocab_size). It is the log probability of the next token given the current timestep input and the previous scorer states.

It can be used to score on pruned top-k candidates to prevent computation overhead, or on full vocabulary set when candidates is None.

Parameters:

inp_tokens (torch.Tensor) – The input tensor of the current timestep.
memory (No limit) – The scorer states for this timestep.
candidates (torch.Tensor) – (batch_size x beam_size, scorer_beam_size). The top-k candidates to be scored after the full scorers. If None, scorers will score on full vocabulary set.
attn (torch.Tensor) – The attention weight to be used in CoverageScorer or CTCScorer.

Returns:

torch.Tensor – (batch_size x beam_size, vocab_size), Scores for the next tokens.
memory (No limit) – The memory variables input for this timestep.

permute_mem(memory, index)[source]

This method permutes the scorer memory to synchronize the memory index with the current output and perform batched beam search.

Parameters:

memory (No limit) – The memory variables input for this timestep.
index (torch.Tensor) – (batch_size, beam_size). The index of the previous path.

reset_mem(x, enc_lens)[source]

This method should implement the resetting of memory variables for the scorer.

Parameters:

x (torch.Tensor) – The precomputed encoder states to be used when decoding. (ex. the encoded speech representation to be attended).
enc_lens (torch.Tensor) – The speechbrain-style relative length.

class speechbrain.decoders.scorer.CTCScorer(ctc_fc, blank_index, eos_index, ctc_window_size=0)[source]

Bases: BaseScorerInterface

A wrapper of CTCPrefixScore based on the BaseScorerInterface.

This Scorer is used to provides the CTC label-synchronous scores of the next input tokens. The implementation is based on https://www.merl.com/publications/docs/TR2017-190.pdf.

See:

speechbrain.decoders.scorer.CTCPrefixScore

Parameters:

ctc_fc (torch.nn.Module) – A output linear layer for ctc.
blank_index (int) – The index of the blank token.
eos_index (int) – The index of the end-of-sequence (eos) token.
ctc_window_size (int) – Compute the ctc scores over the time frames using windowing based on attention peaks. If 0, no windowing applied. (default: 0)

Example

>>> import torch
>>> from speechbrain.nnet.linear import Linear
>>> from speechbrain.lobes.models.transformer.TransformerASR import TransformerASR
>>> from speechbrain.decoders import S2STransformerBeamSearcher, CTCScorer, ScorerBuilder
>>> batch_size=8
>>> n_channels=6
>>> input_size=40
>>> d_model=128
>>> tgt_vocab=140
>>> src = torch.rand([batch_size, n_channels, input_size])
>>> tgt = torch.randint(0, tgt_vocab, [batch_size, n_channels])
>>> net = TransformerASR(
...    tgt_vocab, input_size, d_model, 8, 1, 1, 1024, activation=torch.nn.GELU
... )
>>> ctc_lin = Linear(input_shape=(1, 40, d_model), n_neurons=tgt_vocab)
>>> lin = Linear(input_shape=(1, 40, d_model), n_neurons=tgt_vocab)
>>> eos_index = 2
>>> ctc_scorer = CTCScorer(
...    ctc_fc=ctc_lin,
...    blank_index=0,
...    eos_index=eos_index,
... )
>>> scorer = ScorerBuilder(
...     full_scorers=[ctc_scorer],
...     weights={'ctc': 1.0}
... )
>>> searcher = S2STransformerBeamSearcher(
...     modules=[net, lin],
...     bos_index=1,
...     eos_index=eos_index,
...     min_decode_ratio=0.0,
...     max_decode_ratio=1.0,
...     using_eos_threshold=False,
...     beam_size=7,
...     temperature=1.15,
...     scorer=scorer
... )
>>> enc, dec = net.forward(src, tgt)
>>> hyps, _, _, _ = searcher(enc, torch.ones(batch_size))

score(inp_tokens, memory, candidates, attn)[source]

This method scores the new beams based on the CTC scores computed over the time frames.

See:

speechbrain.decoders.scorer.CTCPrefixScore

Parameters:

inp_tokens (torch.Tensor) – The input tensor of the current timestep.
memory (No limit) – The scorer states for this timestep.
candidates (torch.Tensor) – (batch_size x beam_size, scorer_beam_size). The top-k candidates to be scored after the full scorers. If None, scorers will score on full vocabulary set.
attn (torch.Tensor) – The attention weight to be used in CoverageScorer or CTCScorer.

permute_mem(memory, index)[source]

This method permutes the scorer memory to synchronize the memory index with the current output and perform batched CTC beam search.

Parameters:

memory (No limit) – The memory variables input for this timestep.
index (torch.Tensor) – (batch_size, beam_size). The index of the previous path.

reset_mem(x, enc_lens)[source]

This method implement the resetting of memory variables for the CTC scorer.

Parameters:

x (torch.Tensor) – The precomputed encoder states to be used when decoding. (ex. the encoded speech representation to be attended).
enc_lens (torch.Tensor) – The speechbrain-style relative length.

class speechbrain.decoders.scorer.RNNLMScorer(language_model, temperature=1.0)[source]

Bases: BaseScorerInterface

A wrapper of RNNLM based on BaseScorerInterface.

The RNNLMScorer is used to provide the RNNLM scores of the next input tokens based on the current timestep input and the previous scorer states.

Parameters:

language_model (torch.nn.Module) – A RNN-based language model.
temperature (float) – Temperature factor applied to softmax. It changes the probability distribution, being softer when T>1 and sharper with T<1. (default: 1.0)

Example

>>> from speechbrain.nnet.linear import Linear
>>> from speechbrain.lobes.models.RNNLM import RNNLM
>>> from speechbrain.nnet.RNN import AttentionalRNNDecoder
>>> from speechbrain.decoders import S2SRNNBeamSearcher, RNNLMScorer, ScorerBuilder
>>> input_size=17
>>> vocab_size=11
>>> emb = torch.nn.Embedding(
...     embedding_dim=input_size,
...     num_embeddings=vocab_size,
... )
>>> d_model=7
>>> dec = AttentionalRNNDecoder(
...     rnn_type="gru",
...     attn_type="content",
...     hidden_size=3,
...     attn_dim=3,
...     num_layers=1,
...     enc_dim=d_model,
...     input_size=input_size,
... )
>>> n_channels=3
>>> seq_lin = Linear(input_shape=[d_model, n_channels], n_neurons=vocab_size)
>>> lm_weight = 0.4
>>> lm_model = RNNLM(
...     embedding_dim=d_model,
...     output_neurons=vocab_size,
...     dropout=0.0,
...     rnn_neurons=128,
...     dnn_neurons=64,
...     return_hidden=True,
... )
>>> rnnlm_scorer = RNNLMScorer(
...     language_model=lm_model,
...     temperature=1.25,
... )
>>> scorer = ScorerBuilder(
...     full_scorers=[rnnlm_scorer],
...     weights={'rnnlm': lm_weight}
... )
>>> beam_size=5
>>> searcher = S2SRNNBeamSearcher(
...     embedding=emb,
...     decoder=dec,
...     linear=seq_lin,
...     bos_index=1,
...     eos_index=2,
...     min_decode_ratio=0.0,
...     max_decode_ratio=1.0,
...     topk=2,
...     using_eos_threshold=False,
...     beam_size=beam_size,
...     temperature=1.25,
...     scorer=scorer
... )
>>> batch_size=2
>>> enc = torch.rand([batch_size, n_channels, d_model])
>>> wav_len = torch.ones([batch_size])
>>> hyps, _, _, _ = searcher(enc, wav_len)

score(inp_tokens, memory, candidates, attn)[source]

This method scores the new beams based on the RNNLM scores computed over the previous tokens.

Parameters:

inp_tokens (torch.Tensor) – The input tensor of the current timestep.
memory (No limit) – The scorer states for this timestep.
candidates (torch.Tensor) – (batch_size x beam_size, scorer_beam_size). The top-k candidates to be scored after the full scorers. If None, scorers will score on full vocabulary set.
attn (torch.Tensor) – The attention weight to be used in CoverageScorer or CTCScorer.

permute_mem(memory, index)[source]

This method permutes the scorer memory to synchronize the memory index with the current output and perform batched beam search.

Parameters:

memory (No limit) – The memory variables input for this timestep.
index (torch.Tensor) – (batch_size, beam_size). The index of the previous path.

reset_mem(x, enc_lens)[source]

This method implement the resetting of memory variables for the RNNLM scorer.

Parameters:

x (torch.Tensor) – The precomputed encoder states to be used when decoding. (ex. the encoded speech representation to be attended).
enc_lens (torch.Tensor) – The speechbrain-style relative length.

class speechbrain.decoders.scorer.TransformerLMScorer(language_model, temperature=1.0)[source]

Bases: BaseScorerInterface

A wrapper of TransformerLM based on BaseScorerInterface.

The TransformerLMScorer is used to provide the TransformerLM scores of the next input tokens based on the current timestep input and the previous scorer states.

Parameters:

language_model (torch.nn.Module) – A Transformer-based language model.
temperature (float) – Temperature factor applied to softmax. It changes the probability distribution, being softer when T>1 and sharper with T<1. (default: 1.0)

Example

>>> from speechbrain.nnet.linear import Linear
>>> from speechbrain.lobes.models.transformer.TransformerASR import TransformerASR
>>> from speechbrain.lobes.models.transformer.TransformerLM import TransformerLM
>>> from speechbrain.decoders import S2STransformerBeamSearcher, TransformerLMScorer, CTCScorer, ScorerBuilder
>>> input_size=17
>>> vocab_size=11
>>> d_model=128
>>> net = TransformerASR(
...     tgt_vocab=vocab_size,
...     input_size=input_size,
...     d_model=d_model,
...     nhead=8,
...     num_encoder_layers=1,
...     num_decoder_layers=1,
...     d_ffn=256,
...     activation=torch.nn.GELU
... )
>>> lm_model = TransformerLM(
...     vocab=vocab_size,
...     d_model=d_model,
...     nhead=8,
...     num_encoder_layers=1,
...     num_decoder_layers=0,
...     d_ffn=256,
...     activation=torch.nn.GELU,
... )
>>> n_channels=6
>>> ctc_lin = Linear(input_size=d_model, n_neurons=vocab_size)
>>> seq_lin = Linear(input_size=d_model, n_neurons=vocab_size)
>>> eos_index = 2
>>> ctc_scorer = CTCScorer(
...     ctc_fc=ctc_lin,
...     blank_index=0,
...     eos_index=eos_index,
... )
>>> transformerlm_scorer = TransformerLMScorer(
...     language_model=lm_model,
...     temperature=1.15,
... )
>>> ctc_weight_decode=0.4
>>> lm_weight=0.6
>>> scorer = ScorerBuilder(
...     full_scorers=[transformerlm_scorer, ctc_scorer],
...     weights={'transformerlm': lm_weight, 'ctc': ctc_weight_decode}
... )
>>> beam_size=5
>>> searcher = S2STransformerBeamSearcher(
...     modules=[net, seq_lin],
...     bos_index=1,
...     eos_index=eos_index,
...     min_decode_ratio=0.0,
...     max_decode_ratio=1.0,
...     using_eos_threshold=False,
...     beam_size=beam_size,
...     temperature=1.15,
...     scorer=scorer
... )
>>> batch_size=2
>>> wav_len = torch.ones([batch_size])
>>> src = torch.rand([batch_size, n_channels, input_size])
>>> tgt = torch.randint(0, vocab_size, [batch_size, n_channels])
>>> enc, dec = net.forward(src, tgt)
>>> hyps, _, _, _ = searcher(enc, wav_len)

score(inp_tokens, memory, candidates, attn)[source]

This method scores the new beams based on the TransformerLM scores computed over the previous tokens.

Parameters:

inp_tokens (torch.Tensor) – The input tensor of the current timestep.
memory (No limit) – The scorer states for this timestep.
candidates (torch.Tensor) – (batch_size x beam_size, scorer_beam_size). The top-k candidates to be scored after the full scorers. If None, scorers will score on full vocabulary set.
attn (torch.Tensor) – The attention weight to be used in CoverageScorer or CTCScorer.

permute_mem(memory, index)[source]

This method permutes the scorer memory to synchronize the memory index with the current output and perform batched beam search.

Parameters:

memory (No limit) – The memory variables input for this timestep.
index (torch.Tensor) – (batch_size, beam_size). The index of the previous path.

reset_mem(x, enc_lens)[source]

This method implement the resetting of memory variables for the RNNLM scorer.

Parameters:

x (torch.Tensor) – The precomputed encoder states to be used when decoding. (ex. the encoded speech representation to be attended).
enc_lens (torch.Tensor) – The speechbrain-style relative length.

class speechbrain.decoders.scorer.KenLMScorer(lm_path, vocab_size, token_list)[source]

Bases: BaseScorerInterface

KenLM N-gram scorer.

This scorer is based on KenLM, which is a fast and efficient N-gram language model toolkit. It is used to provide the n-gram scores of the next input tokens.

This scorer is dependent on the KenLM package. It can be installed with the following command:

> pip install https://github.com/kpu/kenlm/archive/master.zip

Note: The KenLM scorer is computationally expensive. It is recommended to use it as a partial scorer to score on the top-k candidates instead of the full vocabulary set.

Parameters:

lm_path (str) – The path of ngram model.
vocab_size (int) – The total number of tokens.
token_list (list) – The tokens set.
Example (#) –
------- (#) –
Linear (# >>> from speechbrain.nnet.linear import) –
AttentionalRNNDecoder (# >>> from speechbrain.nnet.RNN import) –
S2SRNNBeamSearcher (# >>> from speechbrain.decoders import) –
KenLMScorer –
ScorerBuilder –
input_size=17 (# >>>) –
vocab_size=11 (# >>>) –
.bin (# >>> lm_path='path/to/kenlm_model.arpa' # or) –
token_list=['<pad>' (# >>>) –
'<bos>' –
'<eos>' –
'a' –
'b' –
'c' –
'd' –
'e' –
'f' –
'g' –
'h' –
'i'] –
torch.nn.Embedding( (# >>> emb =) –
embedding_dim=input_size (# ...) –

:param : :param # … num_embeddings=vocab_size: :param : :param # … ): :param # >>> d_model=7: :param # >>> dec = AttentionalRNNDecoder(: :param # … rnn_type=”gru”: :param : :param # … attn_type=”content”: :param : :param # … hidden_size=3: :param : :param # … attn_dim=3: :param : :param # … num_layers=1: :param : :param # … enc_dim=d_model: :param : :param # … input_size=input_size: :param : :param # … ): :param # >>> n_channels=3: :param # >>> seq_lin = Linear(input_shape=[d_model: :param n_channels]: :param n_neurons=vocab_size): :param # >>> kenlm_weight = 0.4: :param # >>> kenlm_model = KenLMScorer(: :param # … lm_path=lm_path: :param : :param # … vocab_size=vocab_size: :param : :param # … token_list=token_list: :param : :param # … ): :param # >>> scorer = ScorerBuilder(: :param # … full_scorers=[kenlm_model]: :param : :param # … weights={‘kenlm’: :type # … weights={‘kenlm’: kenlm_weight} :param # … ): :param # >>> beam_size=5: :param # >>> searcher = S2SRNNBeamSearcher(: :param # … embedding=emb: :param : :param # … decoder=dec: :param : :param # … linear=seq_lin: :param : :param # … bos_index=1: :param : :param # … eos_index=2: :param : :param # … min_decode_ratio=0.0: :param : :param # … max_decode_ratio=1.0: :param : :param # … topk=2: :param : :param # … using_eos_threshold=False: :param : :param # … beam_size=beam_size: :param : :param # … temperature=1.25: :param : :param # … scorer=scorer: :param # … ): :param # >>> batch_size=2: :param # >>> enc = torch.rand([batch_size: :param n_channels: :param d_model]): :param # >>> wav_len = torch.ones([batch_size]): :param # >>> hyps: :param _: :param _: :param _ = searcher(enc: :param wav_len):

score(inp_tokens, memory, candidates, attn)[source]

This method scores the new beams based on the n-gram scores.

Parameters:

inp_tokens (torch.Tensor) – The input tensor of the current timestep.
memory (No limit) – The scorer states for this timestep.
candidates (torch.Tensor) – (batch_size x beam_size, scorer_beam_size). The top-k candidates to be scored after the full scorers. If None, scorers will score on full vocabulary set.
attn (torch.Tensor) – The attention weight to be used in CoverageScorer or CTCScorer.

permute_mem(memory, index)[source]

This method permutes the scorer memory to synchronize the memory index with the current output and perform batched beam search.

Parameters:

memory (No limit) – The memory variables input for this timestep.
index (torch.Tensor) – (batch_size, beam_size). The index of the previous path.

reset_mem(x, enc_lens)[source]

This method implement the resetting of memory variables for the KenLM scorer.

Parameters:

x (torch.Tensor) – The precomputed encoder states to be used when decoding. (ex. the encoded speech representation to be attended).
enc_lens (torch.Tensor) – The speechbrain-style relative length.

class speechbrain.decoders.scorer.CoverageScorer(vocab_size, threshold=0.5)[source]

Bases: BaseScorerInterface

A coverage penalty scorer to prevent looping of hyps, where `coverage` is the cumulative attention probability vector. Reference: https://arxiv.org/pdf/1612.02695.pdf,

https://arxiv.org/pdf/1808.10792.pdf

Parameters:

vocab_size (int) – The total number of tokens.
threshold (float) – The penalty increases when the coverage of a frame is more than given threshold. (default: 0.5)

Example

>>> from speechbrain.nnet.linear import Linear
>>> from speechbrain.lobes.models.RNNLM import RNNLM
>>> from speechbrain.nnet.RNN import AttentionalRNNDecoder
>>> from speechbrain.decoders import S2SRNNBeamSearcher, RNNLMScorer, CoverageScorer, ScorerBuilder
>>> input_size=17
>>> vocab_size=11
>>> emb = torch.nn.Embedding(
...     num_embeddings=vocab_size,
...     embedding_dim=input_size
... )
>>> d_model=7
>>> dec = AttentionalRNNDecoder(
...     rnn_type="gru",
...     attn_type="content",
...     hidden_size=3,
...     attn_dim=3,
...     num_layers=1,
...     enc_dim=d_model,
...     input_size=input_size,
... )
>>> n_channels=3
>>> seq_lin = Linear(input_shape=[d_model, n_channels], n_neurons=vocab_size)
>>> lm_weight = 0.4
>>> coverage_penalty = 1.0
>>> lm_model = RNNLM(
...     embedding_dim=d_model,
...     output_neurons=vocab_size,
...     dropout=0.0,
...     rnn_neurons=128,
...     dnn_neurons=64,
...     return_hidden=True,
... )
>>> rnnlm_scorer = RNNLMScorer(
...     language_model=lm_model,
...     temperature=1.25,
... )
>>> coverage_scorer = CoverageScorer(vocab_size=vocab_size)
>>> scorer = ScorerBuilder(
...     full_scorers=[rnnlm_scorer, coverage_scorer],
...     weights={'rnnlm': lm_weight, 'coverage': coverage_penalty}
... )
>>> beam_size=5
>>> searcher = S2SRNNBeamSearcher(
...     embedding=emb,
...     decoder=dec,
...     linear=seq_lin,
...     bos_index=1,
...     eos_index=2,
...     min_decode_ratio=0.0,
...     max_decode_ratio=1.0,
...     topk=2,
...     using_eos_threshold=False,
...     beam_size=beam_size,
...     temperature=1.25,
...     scorer=scorer
... )
>>> batch_size=2
>>> enc = torch.rand([batch_size, n_channels, d_model])
>>> wav_len = torch.ones([batch_size])
>>> hyps, _, _, _ = searcher(enc, wav_len)

score(inp_tokens, coverage, candidates, attn)[source]

This method scores the new beams based on the Coverage scorer.

Parameters:

inp_tokens (torch.Tensor) – The input tensor of the current timestep.
coverage (No limit) – The scorer states for this timestep.
candidates (torch.Tensor) – (batch_size x beam_size, scorer_beam_size). The top-k candidates to be scored after the full scorers. If None, scorers will score on full vocabulary set.
attn (torch.Tensor) – The attention weight to be used in CoverageScorer or CTCScorer.

permute_mem(coverage, index)[source]

This method permutes the scorer memory to synchronize the memory index with the current output and perform batched beam search.

Parameters:

coverage (No limit) – The memory variables input for this timestep.
index (torch.Tensor) – (batch_size, beam_size). The index of the previous path.

reset_mem(x, enc_lens)[source]

This method implement the resetting of memory variables for the RNNLM scorer.

Parameters:

x (torch.Tensor) – The precomputed encoder states to be used when decoding. (ex. the encoded speech representation to be attended).
enc_lens (torch.Tensor) – The speechbrain-style relative length.

class speechbrain.decoders.scorer.LengthScorer(vocab_size)[source]

Bases: BaseScorerInterface

A length rewarding scorer.

The LengthScorer is used to provide the length rewarding scores. It is used to prevent the beam search from favoring short hypotheses.

Note: length_normalization is not compatible with this scorer. Make sure to set is to False when using LengthScorer.

Parameters:: vocab_size (int) – The total number of tokens.

Example

>>> from speechbrain.nnet.linear import Linear
>>> from speechbrain.lobes.models.RNNLM import RNNLM
>>> from speechbrain.nnet.RNN import AttentionalRNNDecoder
>>> from speechbrain.decoders import S2SRNNBeamSearcher, RNNLMScorer, CoverageScorer, ScorerBuilder
>>> input_size=17
>>> vocab_size=11
>>> emb = torch.nn.Embedding(
...     num_embeddings=vocab_size,
...     embedding_dim=input_size
... )
>>> d_model=7
>>> dec = AttentionalRNNDecoder(
...     rnn_type="gru",
...     attn_type="content",
...     hidden_size=3,
...     attn_dim=3,
...     num_layers=1,
...     enc_dim=d_model,
...     input_size=input_size,
... )
>>> n_channels=3
>>> seq_lin = Linear(input_shape=[d_model, n_channels], n_neurons=vocab_size)
>>> lm_weight = 0.4
>>> length_weight = 1.0
>>> lm_model = RNNLM(
...     embedding_dim=d_model,
...     output_neurons=vocab_size,
...     dropout=0.0,
...     rnn_neurons=128,
...     dnn_neurons=64,
...     return_hidden=True,
... )
>>> rnnlm_scorer = RNNLMScorer(
...     language_model=lm_model,
...     temperature=1.25,
... )
>>> length_scorer = LengthScorer(vocab_size=vocab_size)
>>> scorer = ScorerBuilder(
...     full_scorers=[rnnlm_scorer, length_scorer],
...     weights={'rnnlm': lm_weight, 'length': length_weight}
... )
>>> beam_size=5
>>> searcher = S2SRNNBeamSearcher(
...     embedding=emb,
...     decoder=dec,
...     linear=seq_lin,
...     bos_index=1,
...     eos_index=2,
...     min_decode_ratio=0.0,
...     max_decode_ratio=1.0,
...     topk=2,
...     using_eos_threshold=False,
...     beam_size=beam_size,
...     temperature=1.25,
...     length_normalization=False,
...     scorer=scorer
... )
>>> batch_size=2
>>> enc = torch.rand([batch_size, n_channels, d_model])
>>> wav_len = torch.ones([batch_size])
>>> hyps, _, _, _ = searcher(enc, wav_len)

score(inp_tokens, memory, candidates, attn)[source]

This method scores the new beams based on the Length scorer.

Parameters:

inp_tokens (torch.Tensor) – The input tensor of the current timestep.
memory (No limit) – The scorer states for this timestep.
candidates (torch.Tensor) – (batch_size x beam_size, scorer_beam_size). The top-k candidates to be scored after the full scorers. If None, scorers will score on full vocabulary set.
attn (torch.Tensor) – The attention weight to be used in CoverageScorer or CTCScorer.

class speechbrain.decoders.scorer.ScorerBuilder(weights={}, full_scorers=[], partial_scorers=[], scorer_beam_scale=2)[source]

Bases: object

Builds scorer instance for beamsearch.

The ScorerBuilder class is responsible for building a scorer instance for beam search. It takes weights for full and partial scorers, as well as instances of full and partial scorer classes. It combines the scorers based on the weights specified and provides methods for scoring tokens, permuting scorer memory, and resetting scorer memory.

This is the class to be used for building scorer instances for beam search.

See speechbrain.decoders.seq2seq.S2SBeamSearcher()

Parameters:

weights (dict) – Weights of full/partial scorers specified.
full_scorers (list) – Scorers that score on full vocabulary set.
partial_scorers (list) – Scorers that score on pruned tokens to prevent computation overhead. Partial scoring is performed after full scorers.
scorer_beam_scale (float) – The scale decides the number of pruned tokens for partial scorers: int(beam_size * scorer_beam_scale).

Example

>>> from speechbrain.nnet.linear import Linear
>>> from speechbrain.lobes.models.transformer.TransformerASR import TransformerASR
>>> from speechbrain.lobes.models.transformer.TransformerLM import TransformerLM
>>> from speechbrain.decoders import S2STransformerBeamSearcher, TransformerLMScorer, CoverageScorer, CTCScorer, ScorerBuilder
>>> input_size=17
>>> vocab_size=11
>>> d_model=128
>>> net = TransformerASR(
...     tgt_vocab=vocab_size,
...     input_size=input_size,
...     d_model=d_model,
...     nhead=8,
...     num_encoder_layers=1,
...     num_decoder_layers=1,
...     d_ffn=256,
...     activation=torch.nn.GELU
... )
>>> lm_model = TransformerLM(
...     vocab=vocab_size,
...     d_model=d_model,
...     nhead=8,
...     num_encoder_layers=1,
...     num_decoder_layers=0,
...     d_ffn=256,
...     activation=torch.nn.GELU,
... )
>>> n_channels=6
>>> ctc_lin = Linear(input_size=d_model, n_neurons=vocab_size)
>>> seq_lin = Linear(input_size=d_model, n_neurons=vocab_size)
>>> eos_index = 2
>>> ctc_scorer = CTCScorer(
...     ctc_fc=ctc_lin,
...     blank_index=0,
...     eos_index=eos_index,
... )
>>> transformerlm_scorer = TransformerLMScorer(
...     language_model=lm_model,
...     temperature=1.15,
... )
>>> coverage_scorer = CoverageScorer(vocab_size=vocab_size)
>>> ctc_weight_decode=0.4
>>> lm_weight=0.6
>>> coverage_penalty = 1.0
>>> scorer = ScorerBuilder(
...     full_scorers=[transformerlm_scorer, coverage_scorer],
...     partial_scorers=[ctc_scorer],
...     weights={'transformerlm': lm_weight, 'ctc': ctc_weight_decode, 'coverage': coverage_penalty}
... )
>>> beam_size=5
>>> searcher = S2STransformerBeamSearcher(
...     modules=[net, seq_lin],
...     bos_index=1,
...     eos_index=eos_index,
...     min_decode_ratio=0.0,
...     max_decode_ratio=1.0,
...     using_eos_threshold=False,
...     beam_size=beam_size,
...     topk=3,
...     temperature=1.15,
...     scorer=scorer
... )
>>> batch_size=2
>>> wav_len = torch.ones([batch_size])
>>> src = torch.rand([batch_size, n_channels, input_size])
>>> tgt = torch.randint(0, vocab_size, [batch_size, n_channels])
>>> enc, dec = net.forward(src, tgt)
>>> hyps, _, _, _  = searcher(enc, wav_len)

score(inp_tokens, memory, attn, log_probs, beam_size)[source]

This method scores tokens in vocabulary based on defined full scorers and partial scorers. Scores will be added to the log probs for beamsearch.

Parameters:

inp_tokens (torch.Tensor) – See BaseScorerInterface().
memory (dict[str, scorer memory]) – The states of scorers for this timestep.
attn (torch.Tensor) – See BaseScorerInterface().
log_probs (torch.Tensor) – (batch_size x beam_size, vocab_size). The log probs at this timestep.
beam_size (int) – The beam size.

Returns:

log_probs (torch.Tensor) – (batch_size x beam_size, vocab_size). Log probs updated by scorers.
new_memory (dict[str, scorer memory]) – The updated states of scorers.

permute_scorer_mem(memory, index, candidates)[source]

Update memory variables of scorers to synchronize the memory index with the current output and perform batched beam search.

Parameters:

memory (dict[str, scorer memory]) – The states of scorers for this timestep.
index (torch.Tensor) – (batch_size x beam_size). The index of the previous path.
candidates (torch.Tensor) – (batch_size, beam_size). The index of the topk candidates.

reset_scorer_mem(x, enc_lens)[source]

Reset memory variables for scorers.

Parameters:

x (torch.Tensor) – See BaseScorerInterface().
wav_len (torch.Tensor) – See BaseScorerInterface().

class speechbrain.decoders.scorer.BaseRescorerInterface[source]

Bases: BaseScorerInterface

A scorer abstraction intended for inheritance by other scoring approaches used in beam search.

In this approach, a neural network is employed to assign scores to potential text transcripts. The beam search decoding process produces a collection of the top K hypotheses. These candidates are subsequently sent to a language model (LM) for ranking. The ranking is carried out by the LM, which assigns a score to each candidate.

The score is computed as follows:

score = beam_search_score + lm_weight * rescorer_score

See:

speechbrain.decoders.scorer.RNNLMRescorer
speechbrain.decoders.scorer.TransformerLMRescorer
speechbrain.decoders.scorer.HuggingFaceLMRescorer

normalize_text(text)[source]

This method should implement the normalization of the text before scoring.

Parameters:: text (list of str) – The text to be normalized.

preprocess_func(hyps)[source]

This method should implement the preprocessing of the hypotheses before scoring.

Parameters:: hyps (list of str) – The hypotheses to be preprocessed.

rescore_hyps(hyps)[source]

This method should implement the rescoring of the hypotheses.

Parameters:: hyps (list of str) – The hypotheses to be rescored.

to_device(device=None)[source]

This method should implement the moving of the scorer to a device.

If device is None, the scorer should be moved to the default device provided in the constructor.

Parameters:: device (str) – The device to move the scorer to.

class speechbrain.decoders.scorer.RNNLMRescorer(language_model, tokenizer, device='cuda', temperature=1.0, bos_index=0, eos_index=0, pad_index=0)[source]

Bases: BaseRescorerInterface

A wrapper of RNNLM based on the BaseRescorerInterface.

Parameters:

language_model (torch.nn.Module) – A RNN-based language model.
tokenizer (SentencePieceProcessor) – A SentencePiece tokenizer.
device (str) – The device to move the scorer to.
temperature (float) – Temperature factor applied to softmax. It changes the probability distribution, being softer when T>1 and sharper with T<1. (default: 1.0)
bos_index (int) – The index of the beginning-of-sequence (bos) token.
eos_index (int) – The index of the end-of-sequence (eos) token.
pad_index (int) – The index of the padding token.

Note

This class is intented to be used with a pretrained TransformerLM model. Please see: https://huggingface.co/speechbrain/asr-crdnn-rnnlm-librispeech

By default, this model is using SentencePiece tokenizer.

Example

>>> import torch
>>> from sentencepiece import SentencePieceProcessor
>>> from speechbrain.lobes.models.RNNLM import RNNLM
>>> from speechbrain.utils.parameter_transfer import Pretrainer
>>> source = "speechbrain/asr-crdnn-rnnlm-librispeech"
>>> lm_model_path = source + "/lm.ckpt"
>>> tokenizer_path = source + "/tokenizer.ckpt"
>>> # define your tokenizer and RNNLM from the HF hub
>>> tokenizer = SentencePieceProcessor()
>>> lm_model = RNNLM(
...    output_neurons = 1000,
...    embedding_dim = 128,
...    activation = torch.nn.LeakyReLU,
...    dropout = 0.0,
...    rnn_layers = 2,
...    rnn_neurons = 2048,
...    dnn_blocks = 1,
...    dnn_neurons = 512,
...    return_hidden = True,
... )
>>> pretrainer = Pretrainer(
...     collect_in = getfixture("tmp_path"),
...    loadables = {
...     "lm" : lm_model,
...     "tokenizer" : tokenizer,
...     },
...    paths = {
...     "lm" : lm_model_path,
...     "tokenizer" : tokenizer_path,
... })
>>> _ = pretrainer.collect_files()
>>> pretrainer.load_collected()
>>> from speechbrain.decoders.scorer import RNNLMRescorer, RescorerBuilder
>>> rnnlm_rescorer = RNNLMRescorer(
...    language_model = lm_model,
...    tokenizer = tokenizer,
...    temperature = 1.0,
...    bos_index = 0,
...    eos_index = 0,
...    pad_index = 0,
... )
>>> # Define a rescorer builder
>>> rescorer = RescorerBuilder(
...    rescorers=[rnnlm_rescorer],
...    weights={"rnnlm":1.0}
... )
>>> # topk hyps
>>> topk_hyps = [["HELLO", "HE LLO", "H E L L O"]]
>>> topk_scores = [[-2, -2, -2]]
>>> rescored_hyps, rescored_scores = rescorer.rescore(topk_hyps, topk_scores)
>>> # NOTE: the returned hypotheses are already sorted by score.
>>> rescored_hyps 
[['HELLO', 'H E L L O', 'HE LLO']]
>>> # NOTE: as we are returning log-probs, the more it is closer to 0, the better.
>>> rescored_scores 
[[-17.863974571228027, -25.12890625, -26.075977325439453]]

normalize_text(text)[source]

This method should implement the normalization of the text before scoring.

Default to uppercasing the text because the (current) language models are trained on LibriSpeech which is all uppercase.

Parameters:: text (str) – The text to be normalized.
Returns:: The normalized text.
Return type:: str

to_device(device=None)[source]

This method moves the scorer to a device.

If device is None, the scorer is moved to the default device provided in the constructor.

Parameters:: device (str) – The device to move the scorer to.

preprocess_func(topk_hyps)[source]

This method preprocesses the hypotheses before scoring.

Parameters:

topk_hyps (list of list of str) – The hypotheses to be preprocessed.

Returns:

padded_hyps (torch.Tensor) – The padded hypotheses.
enc_hyps_length (list of int) – The length of each hypothesis.

rescore_hyps(topk_hyps)[source]

This method implement the rescoring of the hypotheses.

Parameters:: topk_hyps (list of list of str) – The hypotheses to be rescored.
Returns:: log_probs_scores – The rescored hypotheses scores
Return type:: torch.Tensor[B * Topk, 1]

class speechbrain.decoders.scorer.TransformerLMRescorer(language_model, tokenizer, device='cuda', temperature=1.0, bos_index=0, eos_index=0, pad_index=0)[source]

Bases: BaseRescorerInterface

A wrapper of TransformerLM based on the BaseRescorerInterface.

Parameters:

language_model (torch.nn.Module) – A Transformer-based language model.
tokenizer (SentencePieceProcessor) – A SentencePiece tokenizer.
device (str) – The device to move the scorer to.
temperature (float) – Temperature factor applied to softmax. It changes the probability distribution, being softer when T>1 and sharper with T<1. (default: 1.0)
bos_index (int) – The index of the beginning-of-sequence (bos) token.
eos_index (int) – The index of the end-of-sequence (eos) token.
pad_index (int) – The index of the padding token.

Note

This class is intented to be used with a pretrained TransformerLM model. Please see: https://huggingface.co/speechbrain/asr-transformer-transformerlm-librispeech

By default, this model is using SentencePiece tokenizer.

Example

>>> import torch
>>> from sentencepiece import SentencePieceProcessor
>>> from speechbrain.lobes.models.transformer.TransformerLM import TransformerLM
>>> from speechbrain.utils.parameter_transfer import Pretrainer
>>> source = "speechbrain/asr-transformer-transformerlm-librispeech"
>>> lm_model_path = source + "/lm.ckpt"
>>> tokenizer_path = source + "/tokenizer.ckpt"
>>> tokenizer = SentencePieceProcessor()
>>> lm_model = TransformerLM(
...     vocab=5000,
...     d_model=768,
...     nhead=12,
...     num_encoder_layers=12,
...     num_decoder_layers=0,
...     d_ffn=3072,
...     dropout=0.0,
...     activation=torch.nn.GELU,
...     normalize_before=False,
... )
>>> pretrainer = Pretrainer(
...     collect_in = getfixture("tmp_path"),
...     loadables={
...         "lm": lm_model,
...         "tokenizer": tokenizer,
...     },
...     paths={
...         "lm": lm_model_path,
...         "tokenizer": tokenizer_path,
...     }
... )
>>> _ = pretrainer.collect_files()
>>> pretrainer.load_collected()
>>> from speechbrain.decoders.scorer import TransformerLMRescorer, RescorerBuilder
>>> transformerlm_rescorer = TransformerLMRescorer(
...     language_model=lm_model,
...     tokenizer=tokenizer,
...     temperature=1.0,
...     bos_index=1,
...     eos_index=2,
...     pad_index=0,
... )
>>> rescorer = RescorerBuilder(
...     rescorers=[transformerlm_rescorer],
...     weights={"transformerlm": 1.0}
... )
>>> topk_hyps = [["HELLO", "HE LLO", "H E L L O"]]
>>> topk_scores = [[-2, -2, -2]]
>>> rescored_hyps, rescored_scores = rescorer.rescore(topk_hyps, topk_scores)
>>> # NOTE: the returned hypotheses are already sorted by score.
>>> rescored_hyps 
[["HELLO", "HE L L O", "HE LLO"]]
>>> # NOTE: as we are returning log-probs, the more it is closer to 0, the better.
>>> rescored_scores  
[[-17.863974571228027, -25.12890625, -26.075977325439453]]

normalize_text(text)[source]

This method should implement the normalization of the text before scoring.

Default to uppercasing the text because the language models are trained on LibriSpeech.

Parameters:: text (str) – The text to be normalized.
Returns:: The normalized text.
Return type:: str

to_device(device=None)[source]

This method moves the scorer to a device.

If device is None, the scorer is moved to the default device provided in the constructor.

This method is dynamically called in the recipes when the stage is equal to TEST.

Parameters:: device (str) – The device to move the scorer to.

preprocess_func(topk_hyps)[source]

This method preprocesses the hypotheses before scoring.

Parameters:

topk_hyps (list of list of str) – The hypotheses to be preprocessed.

Returns:

padded_hyps (torch.Tensor) – The padded hypotheses.
enc_hyps_length (list of int) – The length of each hypothesis.

rescore_hyps(topk_hyps)[source]

This method implement the rescoring of the hypotheses.

Parameters:: topk_hyps (list of list of str) – The hypotheses to be rescored.
Returns:: log_probs_scores – The rescored hypotheses scores
Return type:: torch.Tensor[B * Topk, 1]

class speechbrain.decoders.scorer.HuggingFaceLMRescorer(model_name, device='cuda')[source]

Bases: BaseRescorerInterface

A wrapper of HuggingFace’s TransformerLM based on the BaseRescorerInterface.

Parameters:

model_name (str) – The name of the model to be loaded.
device (str) – The device to be used for scoring. (default: “cuda”)

Example

>>> from speechbrain.decoders.scorer import HuggingFaceLMRescorer, RescorerBuilder
>>> source = "gpt2-medium"
>>> huggingfacelm_rescorer = HuggingFaceLMRescorer(
...     model_name=source,
... )
>>> rescorer = RescorerBuilder(
...     rescorers=[huggingfacelm_rescorer],
...     weights={"huggingfacelm": 1.0}
... )
>>> topk_hyps = [["Hello everyone.", "Hell o every one.", "Hello every one"]]
>>> topk_scores = [[-2, -2, -2]]
>>> rescored_hyps, rescored_scores = rescorer.rescore(topk_hyps, topk_scores)
>>> # NOTE: the returned hypotheses are already sorted by score.
>>> rescored_hyps 
[['Hello everyone.', 'Hello every one', 'Hell o every one.']]
>>> # NOTE: as we are returning log-probs, the more it is closer to 0, the better.
>>> rescored_scores 
[[-20.03631591796875, -27.615638732910156, -42.662353515625]]

to_device(device=None)[source]

This method moves the scorer to a device.

If device is None, the scorer is moved to the default device provided in the constructor.

This method is dynamically called in the recipes when the stage is equal to TEST.

Parameters:: device (str) – The device to move the scorer to.

normalize_text(text)[source]

This method should implement the normalization of the text before scoring.

Parameters:: text (str) – The text to be normalized.
Returns:: normalized_text – The normalized text. In this case we do not apply any normalization. However, this method can be overriden to apply any normalization.
Return type:: str

preprocess_func(topk_hyps)[source]

This method preprocesses the hypotheses before scoring.

Parameters:: topk_hyps (list of str) – The hypotheses to be preprocessed.
Returns:: encoding – The encoding of the hypotheses.
Return type:: tensor

rescore_hyps(topk_hyps)[source]

This method implement the rescoring of the hypotheses.

Parameters:: topk_hyps (list of list of str) – The hypotheses to be rescored.
Returns:: log_probs_scores – The rescored hypotheses scores
Return type:: torch.Tensor[B * Topk, 1]

class speechbrain.decoders.scorer.RescorerBuilder(weights={}, rescorers=[])[source]

Bases: object

Builds rescorer instance for beamsearch.

The RecorerBuilder class is responsible for building a scorer instance for beam search. It takes weights and rescorers classes. It combines the scorers based on the weights specified and provides methods for rescoring text.

This is the class to be used for building rescorer instances for beam search.

Parameters:

weights (dict) – Weights of rescorers specified.
rescorers (list) – Rescorers that re-ranks topk hypotheses.

rescore(topk_candidates, topk_scores)[source]

This method rescores the topk candidates.

Parameters:

topk_candidates (list of list of str) – The topk candidates to be rescored.
topk_scores (list of list of float) – The scores of the topk candidates.

Returns:

output_candidates (list of list of str) – The rescored candidates.
output_scores (list of list of float) – The rescored scores.

move_rescorers_to_device(device=None)[source]

Moves rescorers to device.

Usefull to avoid having on GPU rescorers while being on TRAIN and VALID Stages.

Parameters:: device (str) – The device to be used for scoring. (default: None)