speechbrain.inference.ASR module
Specifies the inference interfaces for Automatic speech Recognition (ASR) modules.
- Authors:
Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023, 2024
Adel Moumen 2023, 2024
Pradnya Kandarkar 2023
Summary
Classes:
Streaming metadata, initialized by |
|
A ready-to-use Encoder ASR model |
|
A ready-to-use Encoder-Decoder ASR model |
|
A ready-to-use, streaming-capable ASR model. |
|
A ready-to-use Whisper ASR model |
Reference
- class speechbrain.inference.ASR.EncoderDecoderASR(*args, **kwargs)[source]
Bases:
Pretrained
A ready-to-use Encoder-Decoder ASR model
The class can be used either to run only the encoder (encode()) to extract features or to run the entire encoder-decoder model (transcribe()) to transcribe speech. The given YAML must contain the fields specified in the *_NEEDED[] lists.
Example
>>> from speechbrain.inference.ASR import EncoderDecoderASR >>> tmpdir = getfixture("tmpdir") >>> asr_model = EncoderDecoderASR.from_hparams( ... source="speechbrain/asr-crdnn-rnnlm-librispeech", ... savedir=tmpdir, ... ) >>> asr_model.transcribe_file("tests/samples/single-mic/example2.flac") "MY FATHER HAS REVEALED THE CULPRIT'S NAME"
- HPARAMS_NEEDED = ['tokenizer']
- MODULES_NEEDED = ['encoder', 'decoder']
- encode_batch(wavs, wav_lens)[source]
Encodes the input audio into a sequence of hidden states
The waveforms should already be in the model’s desired format. You can call:
normalized = EncoderDecoderASR.normalizer(signal, sample_rate)
to get a correctly converted signal in most cases.- Parameters:
wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
The encoded batch
- Return type:
- transcribe_batch(wavs, wav_lens)[source]
Transcribes the input audio into a sequence of words
The waveforms should already be in the model’s desired format. You can call:
normalized = EncoderDecoderASR.normalizer(signal, sample_rate)
to get a correctly converted signal in most cases.- Parameters:
wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
list – Each waveform in the batch transcribed.
tensor – Each predicted token id.
- class speechbrain.inference.ASR.EncoderASR(*args, **kwargs)[source]
Bases:
Pretrained
A ready-to-use Encoder ASR model
The class can be used either to run only the encoder (encode()) to extract features or to run the entire encoder + decoder function model (transcribe()) to transcribe speech. The given YAML must contain the fields specified in the *_NEEDED[] lists.
Example
>>> from speechbrain.inference.ASR import EncoderASR >>> tmpdir = getfixture("tmpdir") >>> asr_model = EncoderASR.from_hparams( ... source="speechbrain/asr-wav2vec2-commonvoice-fr", ... savedir=tmpdir, ... ) >>> asr_model.transcribe_file("samples/audio_samples/example_fr.wav")
- HPARAMS_NEEDED = ['tokenizer', 'decoding_function']
- MODULES_NEEDED = ['encoder']
- set_decoding_function()[source]
Set the decoding function based on the parameters defined in the hyperparameter file.
The decoding function is determined by the
decoding_function
specified in the hyperparameter file. It can be either a functools.partial object representing a decoding function or an instance ofspeechbrain.decoders.ctc.CTCBaseSearcher
for beam search decoding.- Raises:
- ValueError: If the decoding function is neither a functools.partial nor an instance of
speechbrain.decoders.ctc.CTCBaseSearcher.
- Note:
For greedy decoding (functools.partial), the provided
decoding_function
is assigned directly.For CTCBeamSearcher decoding, an instance of the specified
decoding_function
is created, and
additional parameters are added based on the tokenizer type.
- encode_batch(wavs, wav_lens)[source]
Encodes the input audio into a sequence of hidden states
The waveforms should already be in the model’s desired format. You can call:
normalized = EncoderASR.normalizer(signal, sample_rate)
to get a correctly converted signal in most cases.- Parameters:
wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
The encoded batch
- Return type:
- transcribe_batch(wavs, wav_lens)[source]
Transcribes the input audio into a sequence of words
The waveforms should already be in the model’s desired format. You can call:
normalized = EncoderASR.normalizer(signal, sample_rate)
to get a correctly converted signal in most cases.- Parameters:
wavs (torch.Tensor) – Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
list – Each waveform in the batch transcribed.
tensor – Each predicted token id.
- class speechbrain.inference.ASR.WhisperASR(*args, **kwargs)[source]
Bases:
Pretrained
A ready-to-use Whisper ASR model
The class can be used to run the entire encoder-decoder whisper model (transcribe()) to transcribe speech. The given YAML must contains the fields specified in the *_NEEDED[] lists.
Example
>>> from speechbrain.inference.ASR import WhisperASR >>> tmpdir = getfixture("tmpdir") >>> asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-medium-commonvoice-it", savedir=tmpdir,) >>> asr_model.transcribe_file("speechbrain/asr-whisper-medium-commonvoice-it/example-it.wav")
- HPARAMS_NEEDED = ['language']
- MODULES_NEEDED = ['whisper', 'decoder']
- encode_batch(wavs, wav_lens)[source]
Encodes the input audio into a sequence of hidden states
The waveforms should already be in the model’s desired format. You can call:
normalized = EncoderDecoderASR.normalizer(signal, sample_rate)
to get a correctly converted signal in most cases.- Parameters:
wavs (torch.tensor) – Batch of waveforms [batch, time, channels].
wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
The encoded batch
- Return type:
torch.tensor
- transcribe_batch(wavs, wav_lens)[source]
Transcribes the input audio into a sequence of words
The waveforms should already be in the model’s desired format. You can call:
normalized = EncoderDecoderASR.normalizer(signal, sample_rate)
to get a correctly converted signal in most cases.- Parameters:
wavs (torch.tensor) – Batch of waveforms [batch, time, channels].
wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
list – Each waveform in the batch transcribed.
tensor – Each predicted token id.
- class speechbrain.inference.ASR.ASRStreamingContext(config: DynChunkTrainConfig, fea_extractor_context: Any, encoder_context: Any, decoder_context: Any, tokenizer_context: List[Any] | None)[source]
Bases:
object
Streaming metadata, initialized by
make_streaming_context()
(see there for details on initialization of fields here).This object is intended to be mutate: the same object should be passed across calls as streaming progresses (namely when using the lower-level
encode_chunk()
, etc. APIs).Holds some references to opaque streaming contexts, so the context is model-agnostic to an extent.
- config: DynChunkTrainConfig
Dynamic chunk training configuration used to initialize the streaming context. Cannot be modified on the fly.
- class speechbrain.inference.ASR.StreamingASR(*args, **kwargs)[source]
Bases:
Pretrained
A ready-to-use, streaming-capable ASR model.
Example
>>> from speechbrain.inference.ASR import StreamingASR >>> from speechbrain.utils.dynamic_chunk_training import DynChunkTrainConfig >>> tmpdir = getfixture("tmpdir") >>> asr_model = StreamingASR.from_hparams(source="speechbrain/asr-conformer-streaming-librispeech", savedir=tmpdir,) >>> asr_model.transcribe_file("speechbrain/asr-conformer-streaming-librispeech/test-en.wav", DynChunkTrainConfig(24, 8))
- HPARAMS_NEEDED = ['fea_streaming_extractor', 'make_decoder_streaming_context', 'decoding_function', 'make_tokenizer_streaming_context', 'tokenizer_decode_streaming']
- MODULES_NEEDED = ['enc', 'proj_enc']
- transcribe_file_streaming(path, dynchunktrain_config: DynChunkTrainConfig, use_torchaudio_streaming: bool = True, **kwargs)[source]
Transcribes the given audio file into a sequence of words, in a streaming fashion, meaning that text is being yield from this generator, in the form of strings to concatenate.
- Parameters:
path (str) – URI/path to the audio to transcribe. When
use_torchaudio_streaming
isFalse
, uses SB fetching to allow fetching from HF or a local file. WhenTrue
, resolves the URI through ffmpeg, as documented intorchaudio.io.StreamReader
.dynchunktrain_config (DynChunkTrainConfig) – Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.
use_torchaudio_streaming (bool) – Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).
- Returns:
An iterator yielding transcribed chunks (strings). There is a yield for every chunk, even if the transcribed string for that chunk is an empty string.
- Return type:
generator of str
- transcribe_file(path, dynchunktrain_config: DynChunkTrainConfig, use_torchaudio_streaming: bool = True)[source]
Transcribes the given audio file into a sequence of words.
- Parameters:
path (str) – URI/path to the audio to transcribe. When
use_torchaudio_streaming
isFalse
, uses SB fetching to allow fetching from HF or a local file. WhenTrue
, resolves the URI through ffmpeg, as documented intorchaudio.io.StreamReader
.dynchunktrain_config (DynChunkTrainConfig) – Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.
use_torchaudio_streaming (bool) – Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).
- Returns:
The audio file transcription produced by this ASR system.
- Return type:
- make_streaming_context(dynchunktrain_config: DynChunkTrainConfig)[source]
Create a blank streaming context to be passed around for chunk encoding/transcription.
- Parameters:
dynchunktrain_config (DynChunkTrainConfig) – Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.
- get_chunk_size_frames(dynchunktrain_config: DynChunkTrainConfig) int [source]
Returns the chunk size in actual audio samples, i.e. the exact expected length along the time dimension of an input chunk tensor (as passed to
encode_chunk()
and similar low-level streaming functions).- Parameters:
dynchunktrain_config (DynChunkTrainConfig) – The streaming configuration to determine the chunk frame count of.
- encode_chunk(context: ASRStreamingContext, chunk: Tensor, chunk_len: Tensor | None = None)[source]
Encoding of a batch of audio chunks into a batch of encoded sequences. For full speech-to-text offline transcription, use
transcribe_batch
ortranscribe_file
. Must be called over a given context in the correct order of chunks over time.- Parameters:
context (ASRStreamingContext) – Mutable streaming context object, which must be specified and reused across calls when streaming. You can obtain an initial context by calling
asr.make_streaming_context(config)
.chunk (torch.Tensor) – The tensor for an audio chunk of shape
[batch size, time]
. The time dimension must strictly matchasr.get_chunk_size_frames(config)
. The waveform is expected to be in the model’s expected format (i.e. the sampling rate must be correct).chunk_len (torch.Tensor, optional) – The relative chunk length tensor of shape
[batch size]
. This is to be used when the audio in one of the chunks of the batch is ending within this chunk. If unspecified, equivalent totorch.ones((batch_size,))
.
- Returns:
Encoded output, of a model-dependent shape.
- Return type:
- decode_chunk(context: ASRStreamingContext, x: Tensor) tuple[list, list] [source]
Decodes the output of the encoder into tokens and the associated transcription. Must be called over a given context in the correct order of chunks over time.
- Parameters:
context (ASRStreamingContext) – Mutable streaming context object, which should be the same object that was passed to
encode_chunk
.x (torch.Tensor) – The output of
encode_chunk
for a given chunk.
- Returns:
list of str – Decoded tokens of length
batch_size
. The decoded strings can be of 0-length.list of list of output token hypotheses – List of length
batch_size
, each holding a list of tokens of any length>=0
.
- transcribe_chunk(context: ASRStreamingContext, chunk: Tensor, chunk_len: Tensor | None = None)[source]
Transcription of a batch of audio chunks into transcribed text. Must be called over a given context in the correct order of chunks over time.
- Parameters:
context (ASRStreamingContext) – Mutable streaming context object, which must be specified and reused across calls when streaming. You can obtain an initial context by calling
asr.make_streaming_context(config)
.chunk (torch.Tensor) – The tensor for an audio chunk of shape
[batch size, time]
. The time dimension must strictly matchasr.get_chunk_size_frames(config)
. The waveform is expected to be in the model’s expected format (i.e. the sampling rate must be correct).chunk_len (torch.Tensor, optional) – The relative chunk length tensor of shape
[batch size]
. This is to be used when the audio in one of the chunks of the batch is ending within this chunk. If unspecified, equivalent totorch.ones((batch_size,))
.
- Returns:
Transcribed string for this chunk, might be of length zero.
- Return type: