speechbrain.inference.ASR module

Specifies the inference interfaces for Automatic speech Recognition (ASR) modules.

Authors:
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023, 2024

  • Adel Moumen 2023, 2024

  • Pradnya Kandarkar 2023

Summary

Classes:

ASRStreamingContext

Streaming metadata, initialized by make_streaming_context() (see there for details on initialization of fields here).

EncoderASR

A ready-to-use Encoder ASR model

EncoderDecoderASR

A ready-to-use Encoder-Decoder ASR model

StreamingASR

A ready-to-use, streaming-capable ASR model.

WhisperASR

A ready-to-use Whisper ASR model

Reference

class speechbrain.inference.ASR.EncoderDecoderASR(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use Encoder-Decoder ASR model

The class can be used either to run only the encoder (encode()) to extract features or to run the entire encoder-decoder model (transcribe()) to transcribe speech. The given YAML must contain the fields specified in the *_NEEDED[] lists.

Example

>>> from speechbrain.inference.ASR import EncoderDecoderASR
>>> tmpdir = getfixture("tmpdir")
>>> asr_model = EncoderDecoderASR.from_hparams(
...     source="speechbrain/asr-crdnn-rnnlm-librispeech",
...     savedir=tmpdir,
... )  
>>> asr_model.transcribe_file("tests/samples/single-mic/example2.flac")  
"MY FATHER HAS REVEALED THE CULPRIT'S NAME"
HPARAMS_NEEDED = ['tokenizer']
MODULES_NEEDED = ['encoder', 'decoder']
transcribe_file(path, **kwargs)[source]

Transcribes the given audiofile into a sequence of words.

Parameters:

path (str) – Path to audio file which to transcribe.

Returns:

The audiofile transcription produced by this ASR system.

Return type:

str

encode_batch(wavs, wav_lens)[source]

Encodes the input audio into a sequence of hidden states

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.

  • wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

The encoded batch

Return type:

torch.Tensor

transcribe_batch(wavs, wav_lens)[source]

Transcribes the input audio into a sequence of words

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.

  • wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

  • list – Each waveform in the batch transcribed.

  • tensor – Each predicted token id.

forward(wavs, wav_lens)[source]

Runs full transcription - note: no gradients through decoding

training: bool
class speechbrain.inference.ASR.EncoderASR(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use Encoder ASR model

The class can be used either to run only the encoder (encode()) to extract features or to run the entire encoder + decoder function model (transcribe()) to transcribe speech. The given YAML must contain the fields specified in the *_NEEDED[] lists.

Example

>>> from speechbrain.inference.ASR import EncoderASR
>>> tmpdir = getfixture("tmpdir")
>>> asr_model = EncoderASR.from_hparams(
...     source="speechbrain/asr-wav2vec2-commonvoice-fr",
...     savedir=tmpdir,
... ) 
>>> asr_model.transcribe_file("samples/audio_samples/example_fr.wav") 
HPARAMS_NEEDED = ['tokenizer', 'decoding_function']
MODULES_NEEDED = ['encoder']
set_decoding_function()[source]

Set the decoding function based on the parameters defined in the hyperparameter file.

The decoding function is determined by the decoding_function specified in the hyperparameter file. It can be either a functools.partial object representing a decoding function or an instance of speechbrain.decoders.ctc.CTCBaseSearcher for beam search decoding.

Raises:
ValueError: If the decoding function is neither a functools.partial nor an instance of

speechbrain.decoders.ctc.CTCBaseSearcher.

Note:
  • For greedy decoding (functools.partial), the provided decoding_function is assigned directly.

  • For CTCBeamSearcher decoding, an instance of the specified decoding_function is created, and

additional parameters are added based on the tokenizer type.

transcribe_file(path, **kwargs)[source]

Transcribes the given audiofile into a sequence of words.

Parameters:

path (str) – Path to audio file which to transcribe.

Returns:

The audiofile transcription produced by this ASR system.

Return type:

str

encode_batch(wavs, wav_lens)[source]

Encodes the input audio into a sequence of hidden states

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.

  • wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

The encoded batch

Return type:

torch.Tensor

transcribe_batch(wavs, wav_lens)[source]

Transcribes the input audio into a sequence of words

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.

  • wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

  • list – Each waveform in the batch transcribed.

  • tensor – Each predicted token id.

forward(wavs, wav_lens)[source]

Runs the encoder

training: bool
class speechbrain.inference.ASR.WhisperASR(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use Whisper ASR model

The class can be used to run the entire encoder-decoder whisper model (transcribe()) to transcribe speech. The given YAML must contains the fields specified in the *_NEEDED[] lists.

Example

>>> from speechbrain.inference.ASR import WhisperASR
>>> tmpdir = getfixture("tmpdir")
>>> asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-medium-commonvoice-it", savedir=tmpdir,) 
>>> asr_model.transcribe_file("speechbrain/asr-whisper-medium-commonvoice-it/example-it.wav")  
HPARAMS_NEEDED = ['language']
MODULES_NEEDED = ['whisper', 'decoder']
transcribe_file(path)[source]

Transcribes the given audiofile into a sequence of words.

Parameters:

path (str) – Path to audio file which to transcribe.

Returns:

The audiofile transcription produced by this ASR system.

Return type:

str

encode_batch(wavs, wav_lens)[source]

Encodes the input audio into a sequence of hidden states

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.tensor) – Batch of waveforms [batch, time, channels].

  • wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

The encoded batch

Return type:

torch.tensor

transcribe_batch(wavs, wav_lens)[source]

Transcribes the input audio into a sequence of words

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderASR.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.tensor) – Batch of waveforms [batch, time, channels].

  • wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

  • list – Each waveform in the batch transcribed.

  • tensor – Each predicted token id.

forward(wavs, wav_lens)[source]

Runs full transcription - note: no gradients through decoding

training: bool
class speechbrain.inference.ASR.ASRStreamingContext(config: DynChunkTrainConfig, fea_extractor_context: Any, encoder_context: Any, decoder_context: Any, tokenizer_context: List[Any] | None)[source]

Bases: object

Streaming metadata, initialized by make_streaming_context() (see there for details on initialization of fields here).

This object is intended to be mutate: the same object should be passed across calls as streaming progresses (namely when using the lower-level encode_chunk(), etc. APIs).

Holds some references to opaque streaming contexts, so the context is model-agnostic to an extent.

config: DynChunkTrainConfig

Dynamic chunk training configuration used to initialize the streaming context. Cannot be modified on the fly.

fea_extractor_context: Any

Opaque feature extractor streaming context.

encoder_context: Any

Opaque encoder streaming context.

decoder_context: Any

Opaque decoder streaming context.

tokenizer_context: List[Any] | None

Opaque streaming context for the tokenizer. Initially None. Initialized to a list of tokenizer contexts once batch size can be determined.

class speechbrain.inference.ASR.StreamingASR(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use, streaming-capable ASR model.

Example

>>> from speechbrain.inference.ASR import StreamingASR
>>> from speechbrain.utils.dynamic_chunk_training import DynChunkTrainConfig
>>> tmpdir = getfixture("tmpdir")
>>> asr_model = StreamingASR.from_hparams(source="speechbrain/asr-conformer-streaming-librispeech", savedir=tmpdir,) 
>>> asr_model.transcribe_file("speechbrain/asr-conformer-streaming-librispeech/test-en.wav", DynChunkTrainConfig(24, 8)) 
HPARAMS_NEEDED = ['fea_streaming_extractor', 'make_decoder_streaming_context', 'decoding_function', 'make_tokenizer_streaming_context', 'tokenizer_decode_streaming']
MODULES_NEEDED = ['enc', 'proj_enc']
transcribe_file_streaming(path, dynchunktrain_config: DynChunkTrainConfig, use_torchaudio_streaming: bool = True, **kwargs)[source]

Transcribes the given audio file into a sequence of words, in a streaming fashion, meaning that text is being yield from this generator, in the form of strings to concatenate.

Parameters:
  • path (str) – URI/path to the audio to transcribe. When use_torchaudio_streaming is False, uses SB fetching to allow fetching from HF or a local file. When True, resolves the URI through ffmpeg, as documented in torchaudio.io.StreamReader.

  • dynchunktrain_config (DynChunkTrainConfig) – Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.

  • use_torchaudio_streaming (bool) – Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).

Returns:

An iterator yielding transcribed chunks (strings). There is a yield for every chunk, even if the transcribed string for that chunk is an empty string.

Return type:

generator of str

training: bool
transcribe_file(path, dynchunktrain_config: DynChunkTrainConfig, use_torchaudio_streaming: bool = True)[source]

Transcribes the given audio file into a sequence of words.

Parameters:
  • path (str) – URI/path to the audio to transcribe. When use_torchaudio_streaming is False, uses SB fetching to allow fetching from HF or a local file. When True, resolves the URI through ffmpeg, as documented in torchaudio.io.StreamReader.

  • dynchunktrain_config (DynChunkTrainConfig) – Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.

  • use_torchaudio_streaming (bool) – Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).

Returns:

The audio file transcription produced by this ASR system.

Return type:

str

make_streaming_context(dynchunktrain_config: DynChunkTrainConfig)[source]

Create a blank streaming context to be passed around for chunk encoding/transcription.

Parameters:

dynchunktrain_config (DynChunkTrainConfig) – Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.

get_chunk_size_frames(dynchunktrain_config: DynChunkTrainConfig) int[source]

Returns the chunk size in actual audio samples, i.e. the exact expected length along the time dimension of an input chunk tensor (as passed to encode_chunk() and similar low-level streaming functions).

Parameters:

dynchunktrain_config (DynChunkTrainConfig) – The streaming configuration to determine the chunk frame count of.

encode_chunk(context: ASRStreamingContext, chunk: Tensor, chunk_len: Tensor | None = None)[source]

Encoding of a batch of audio chunks into a batch of encoded sequences. For full speech-to-text offline transcription, use transcribe_batch or transcribe_file. Must be called over a given context in the correct order of chunks over time.

Parameters:
  • context (ASRStreamingContext) – Mutable streaming context object, which must be specified and reused across calls when streaming. You can obtain an initial context by calling asr.make_streaming_context(config).

  • chunk (torch.Tensor) – The tensor for an audio chunk of shape [batch size, time]. The time dimension must strictly match asr.get_chunk_size_frames(config). The waveform is expected to be in the model’s expected format (i.e. the sampling rate must be correct).

  • chunk_len (torch.Tensor, optional) – The relative chunk length tensor of shape [batch size]. This is to be used when the audio in one of the chunks of the batch is ending within this chunk. If unspecified, equivalent to torch.ones((batch_size,)).

Returns:

Encoded output, of a model-dependent shape.

Return type:

torch.Tensor

decode_chunk(context: ASRStreamingContext, x: Tensor) tuple[list, list][source]

Decodes the output of the encoder into tokens and the associated transcription. Must be called over a given context in the correct order of chunks over time.

Parameters:
  • context (ASRStreamingContext) – Mutable streaming context object, which should be the same object that was passed to encode_chunk.

  • x (torch.Tensor) – The output of encode_chunk for a given chunk.

Returns:

  • list of str – Decoded tokens of length batch_size. The decoded strings can be of 0-length.

  • list of list of output token hypotheses – List of length batch_size, each holding a list of tokens of any length >=0.

transcribe_chunk(context: ASRStreamingContext, chunk: Tensor, chunk_len: Tensor | None = None)[source]

Transcription of a batch of audio chunks into transcribed text. Must be called over a given context in the correct order of chunks over time.

Parameters:
  • context (ASRStreamingContext) – Mutable streaming context object, which must be specified and reused across calls when streaming. You can obtain an initial context by calling asr.make_streaming_context(config).

  • chunk (torch.Tensor) – The tensor for an audio chunk of shape [batch size, time]. The time dimension must strictly match asr.get_chunk_size_frames(config). The waveform is expected to be in the model’s expected format (i.e. the sampling rate must be correct).

  • chunk_len (torch.Tensor, optional) – The relative chunk length tensor of shape [batch size]. This is to be used when the audio in one of the chunks of the batch is ending within this chunk. If unspecified, equivalent to torch.ones((batch_size,)).

Returns:

Transcribed string for this chunk, might be of length zero.

Return type:

str