speechbrain.inference.ASR moduleο
Specifies the inference interfaces for Automatic speech Recognition (ASR) modules.
- Authors:
Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023, 2024
Adel Moumen 2023, 2024, 2025
Pradnya Kandarkar 2023
Summaryο
Classes:
Streaming metadata, initialized by |
|
A single chunk of audio for Whisper ASR streaming. |
|
A ready-to-use Encoder ASR model |
|
A ready-to-use Encoder-Decoder ASR model |
|
A ready-to-use SpeechLLM ASR model interface. |
|
A ready-to-use, streaming-capable ASR model. |
|
A ready-to-use Whisper ASR model. |
Referenceο
- class speechbrain.inference.ASR.EncoderDecoderASR(*args, **kwargs)[source]ο
Bases:
PretrainedA ready-to-use Encoder-Decoder ASR model
The class can be used either to run only the encoder (encode()) to extract features or to run the entire encoder-decoder model (transcribe()) to transcribe speech. The given YAML must contain the fields specified in the *_NEEDED[] lists.
Example
>>> from speechbrain.inference.ASR import EncoderDecoderASR >>> tmpdir = getfixture("tmpdir") >>> asr_model = EncoderDecoderASR.from_hparams( ... source="speechbrain/asr-crdnn-rnnlm-librispeech", ... savedir=tmpdir, ... ) >>> asr_model.transcribe_file( ... "tests/samples/single-mic/example2.flac" ... ) "MY FATHER HAS REVEALED THE CULPRIT'S NAME"
- HPARAMS_NEEDED = ['tokenizer']ο
- MODULES_NEEDED = ['encoder', 'decoder']ο
- transcribe_file(path, **kwargs)[source]ο
Transcribes the given audiofile into a sequence of words.
- encode_batch(wavs, wav_lens)[source]ο
Encodes the input audio into a sequence of hidden states
The waveforms should already be in the modelβs desired format. You can call:
normalized = EncoderDecoderASR.normalizer(signal, sample_rate)to get a correctly converted signal in most cases.- Parameters:
wavs (torch.Tensor) β Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) β Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
The encoded batch
- Return type:
- transcribe_batch(wavs, wav_lens)[source]ο
Transcribes the input audio into a sequence of words
The waveforms should already be in the modelβs desired format. You can call:
normalized = EncoderDecoderASR.normalizer(signal, sample_rate)to get a correctly converted signal in most cases.- Parameters:
wavs (torch.Tensor) β Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) β Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
list β Each waveform in the batch transcribed.
tensor β Each predicted token id.
- class speechbrain.inference.ASR.EncoderASR(*args, **kwargs)[source]ο
Bases:
PretrainedA ready-to-use Encoder ASR model
The class can be used either to run only the encoder (encode()) to extract features or to run the entire encoder + decoder function model (transcribe()) to transcribe speech. The given YAML must contain the fields specified in the *_NEEDED[] lists.
Example
>>> from speechbrain.inference.ASR import EncoderASR >>> tmpdir = getfixture("tmpdir") >>> asr_model = EncoderASR.from_hparams( ... source="speechbrain/asr-wav2vec2-commonvoice-fr", ... savedir=tmpdir, ... ) >>> asr_model.transcribe_file( ... "samples/audio_samples/example_fr.wav" ... )
- HPARAMS_NEEDED = ['tokenizer', 'decoding_function']ο
- MODULES_NEEDED = ['encoder']ο
- set_decoding_function()[source]ο
Set the decoding function based on the parameters defined in the hyperparameter file.
The decoding function is determined by the
decoding_functionspecified in the hyperparameter file. It can be either a functools.partial object representing a decoding function or an instance ofspeechbrain.decoders.ctc.CTCBaseSearcherfor beam search decoding.- Raises:
- ValueError: If the decoding function is neither a functools.partial nor an instance of
speechbrain.decoders.ctc.CTCBaseSearcher.
- Note:
For greedy decoding (functools.partial), the provided
decoding_functionis assigned directly.For CTCBeamSearcher decoding, an instance of the specified
decoding_functionis created, and
additional parameters are added based on the tokenizer type.
- transcribe_file(path, **kwargs)[source]ο
Transcribes the given audiofile into a sequence of words.
- encode_batch(wavs, wav_lens)[source]ο
Encodes the input audio into a sequence of hidden states
The waveforms should already be in the modelβs desired format. You can call:
normalized = EncoderASR.normalizer(signal, sample_rate)to get a correctly converted signal in most cases.- Parameters:
wavs (torch.Tensor) β Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) β Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
The encoded batch
- Return type:
- transcribe_batch(wavs, wav_lens)[source]ο
Transcribes the input audio into a sequence of words
The waveforms should already be in the modelβs desired format. You can call:
normalized = EncoderASR.normalizer(signal, sample_rate)to get a correctly converted signal in most cases.- Parameters:
wavs (torch.Tensor) β Batch of waveforms [batch, time, channels] or [batch, time] depending on the model.
wav_lens (torch.Tensor) β Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
list β Each waveform in the batch transcribed.
tensor β Each predicted token id.
- class speechbrain.inference.ASR.ASRWhisperSegment(start: float, end: float, chunk: Tensor, lang_id: str | None = None, words: str | None = None, tokens: List[str] | None = None, prompt: List[str] | None = None, avg_log_probs: float | None = None, no_speech_prob: float | None = None)[source]ο
Bases:
objectA single chunk of audio for Whisper ASR streaming.
This object is intended to be mutated as streaming progresses and passed across calls to the lower-level APIs such as
encode_chunk,decode_chunk, etc.- chunkο
The audio chunk, shape [time, channels].
- Type:
- class speechbrain.inference.ASR.WhisperASR(*args, **kwargs)[source]ο
Bases:
PretrainedA ready-to-use Whisper ASR model.
The class can be used to run the entire encoder-decoder whisper model. The set of tasks supported are:
transcribe,translate, andlang_id. The given YAML must contains the fields specified in the *_NEEDED[] lists.Example
>>> from speechbrain.inference.ASR import WhisperASR >>> tmpdir = getfixture("tmpdir") >>> asr_model = WhisperASR.from_hparams( ... source="speechbrain/asr-whisper-medium-commonvoice-it", ... savedir=tmpdir, ... ) >>> hyp = asr_model.transcribe_file( ... "speechbrain/asr-whisper-medium-commonvoice-it/example-it.wav" ... ) >>> hyp buongiorno a tutti e benvenuti a bordo >>> _, probs = asr_model.detect_language_file( ... "speechbrain/asr-whisper-medium-commonvoice-it/example-it.wav" ... ) >>> print( ... f"Detected language: {max(probs[0], key=probs[0].get)}" ... ) Detected language: it
- HPARAMS_NEEDED = ['language', 'sample_rate']ο
- MODULES_NEEDED = ['whisper', 'decoder']ο
- TASKS = ['transcribe', 'translate', 'lang_id']ο
- detect_language_file(path: str)[source]ο
Detects the language of the given audiofile. This method only works on input_file of 30 seconds or less.
- Parameters:
path (str) β Path to audio file which to transcribe.
- Returns:
language_tokens (torch.Tensor) β The detected language tokens.
language_probs (dict) β The probabilities of the detected language tokens.
- Raises:
ValueError β If the model doesnβt have language tokens.
- detect_language_batch(wav: Tensor)[source]ο
Detects the language of the given wav Tensor. This method only works on wav files of 30 seconds or less.
- Parameters:
wav (torch.tensor) β Batch of waveforms [batch, time, channels].
- Returns:
language_tokens (torch.Tensor of shape (batch_size,)) β ids of the most probable language tokens, which appears after the startoftranscript token.
language_probs (List[Dict[str, float]]) β list of dictionaries containing the probability distribution over all languages.
- Raises:
ValueError β If the model doesnβt have language tokens.
Example
>>> from speechbrain.inference.ASR import WhisperASR >>> from speechbrain.dataio import audio_io >>> tmpdir = getfixture("tmpdir") >>> asr_model = WhisperASR.from_hparams( ... source="speechbrain/asr-whisper-medium-commonvoice-it", ... savedir=tmpdir, ... ) >>> wav, _ = audio_io.load("your_audio") >>> language_tokens, language_probs = asr_model.detect_language( ... wav ... )
- transcribe_file_streaming(path: str, task: str | None = None, initial_prompt: str | None = None, logprob_threshold: float | None = -1.0, no_speech_threshold=0.6, condition_on_previous_text: bool = False, verbose: bool = False, use_torchaudio_streaming: bool = False, chunk_size: int = 30, **kwargs)[source]ο
Transcribes the given audiofile into a sequence of words. This method supports the following tasks:
transcribe,translate, andlang_id. It can process an input audio file longer than 30 seconds by splitting it into chunk_size-second segments.- Parameters:
path (str) β URI/path to the audio to transcribe. When
use_torchaudio_streamingisFalse, uses SB fetching to allow fetching from HF or a local file. WhenTrue, resolves the URI through ffmpeg, as documented intorchaudio.io.StreamReader.task (Optional[str]) β The task to perform. If None, the default task is the one passed in the Whisper model.
initial_prompt (Optional[str]) β The initial prompt to condition the model on.
logprob_threshold (Optional[float]) β The log probability threshold to continue decoding the current segment.
no_speech_threshold (float) β The threshold to skip decoding segment if the no_speech_prob is higher than this value.
condition_on_previous_text (bool) β If True, the model will be condition on the last 224 tokens.
verbose (bool) β If True, print the transcription of each segment.
use_torchaudio_streaming (bool) β Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).
chunk_size (int) β The size of the chunks to split the audio into. The default chunk size is 30 seconds which corresponds to the maximal length that the model can process in one go.
**kwargs (dict) β Arguments forwarded to
load_audio
- Yields:
ASRWhisperSegment β A new ASRWhisperSegment instance initialized with the provided parameters.
- transcribe_file(path: str, task: str | None = None, initial_prompt: str | None = None, logprob_threshold: float | None = -1.0, no_speech_threshold=0.6, condition_on_previous_text: bool = False, verbose: bool = False, use_torchaudio_streaming: bool = False, chunk_size: int | None = 30, **kwargs) List[ASRWhisperSegment][source]ο
Run the Whisper model using the specified task on the given audio file and return the
ASRWhisperSegmentobjects for each segment.This method supports the following tasks:
transcribe,translate, andlang_id. It can process an input audio file longer than 30 seconds by splitting it into chunk_size-second segments.- Parameters:
path (str) β URI/path to the audio to transcribe. When
use_torchaudio_streamingisFalse, uses SB fetching to allow fetching from HF or a local file. WhenTrue, resolves the URI through ffmpeg, as documented intorchaudio.io.StreamReader.task (Optional[str]) β The task to perform. If None, the default task is the one passed in the Whisper model. It can be one of the following:
transcribe,translate,lang_id.initial_prompt (Optional[str]) β The initial prompt to condition the model on.
logprob_threshold (Optional[float]) β The log probability threshold to continue decoding the current segment.
no_speech_threshold (float) β The threshold to skip decoding segment if the no_speech_prob is higher than this value.
condition_on_previous_text (bool) β If True, the model will be condition on the last 224 tokens.
verbose (bool) β If True, print the details of each segment.
use_torchaudio_streaming (bool) β Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).
chunk_size (Optional[int]) β The size of the chunks to split the audio into. The default chunk size is 30 seconds which corresponds to the maximal length that the model can process in one go.
**kwargs (dict) β Arguments forwarded to
load_audio
- Returns:
results β A list of
WhisperASRChunkobjects, each containing the task result.- Return type:
- encode_batch(wavs, wav_lens)[source]ο
Encodes the input audio into a sequence of hidden states
The waveforms should already be in the modelβs desired format. You can call:
normalized = EncoderDecoderASR.normalizer(signal, sample_rate)to get a correctly converted signal in most cases.- Parameters:
wavs (torch.tensor) β Batch of waveforms [batch, time, channels].
wav_lens (torch.tensor) β Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
The encoded batch
- Return type:
torch.tensor
- transcribe_batch(wavs, wav_lens)[source]ο
Transcribes the input audio into a sequence of words
The waveforms should already be in the modelβs desired format. You can call:
normalized = EncoderDecoderASR.normalizer(signal, sample_rate)to get a correctly converted signal in most cases.- Parameters:
wavs (torch.tensor) β Batch of waveforms [batch, time, channels].
wav_lens (torch.tensor) β Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.
- Returns:
list β Each waveform in the batch transcribed.
tensor β Each predicted token id.
- class speechbrain.inference.ASR.ASRStreamingContext(config: DynChunkTrainConfig, fea_extractor_context: Any, encoder_context: Any, decoder_context: Any, tokenizer_context: List[Any] | None)[source]ο
Bases:
objectStreaming metadata, initialized by
make_streaming_context()(see there for details on initialization of fields here).This object is intended to be mutate: the same object should be passed across calls as streaming progresses (namely when using the lower-level
encode_chunk(), etc. APIs).Holds some references to opaque streaming contexts, so the context is model-agnostic to an extent.
- config: DynChunkTrainConfigο
Dynamic chunk training configuration used to initialize the streaming context. Cannot be modified on the fly.
- class speechbrain.inference.ASR.StreamingASR(*args, **kwargs)[source]ο
Bases:
PretrainedA ready-to-use, streaming-capable ASR model.
Example
>>> from speechbrain.inference.ASR import StreamingASR >>> from speechbrain.utils.dynamic_chunk_training import DynChunkTrainConfig >>> tmpdir = getfixture("tmpdir") >>> asr_model = StreamingASR.from_hparams( ... source="speechbrain/asr-conformer-streaming-librispeech", ... savedir=tmpdir, ... ) >>> asr_model.transcribe_file( ... "speechbrain/asr-conformer-streaming-librispeech/test-en.wav", ... DynChunkTrainConfig(24, 8), ... )
- HPARAMS_NEEDED = ['fea_streaming_extractor', 'make_decoder_streaming_context', 'decoding_function', 'make_tokenizer_streaming_context', 'tokenizer_decode_streaming']ο
- MODULES_NEEDED = ['enc', 'proj_enc']ο
- transcribe_file_streaming(path, dynchunktrain_config: DynChunkTrainConfig, use_torchaudio_streaming: bool = True, **kwargs)[source]ο
Transcribes the given audio file into a sequence of words, in a streaming fashion, meaning that text is being yield from this generator, in the form of strings to concatenate.
- Parameters:
path (str) β URI/path to the audio to transcribe. When
use_torchaudio_streamingisFalse, uses SB fetching to allow fetching from HF or a local file. WhenTrue, resolves the URI through ffmpeg, as documented intorchaudio.io.StreamReader.dynchunktrain_config (DynChunkTrainConfig) β Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.
use_torchaudio_streaming (bool) β Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).
**kwargs (dict) β Arguments forwarded to
load_audio
- Yields:
generator of str β An iterator yielding transcribed chunks (strings). There is a yield for every chunk, even if the transcribed string for that chunk is an empty string.
- transcribe_file(path, dynchunktrain_config: DynChunkTrainConfig, use_torchaudio_streaming: bool = True)[source]ο
Transcribes the given audio file into a sequence of words.
- Parameters:
path (str) β URI/path to the audio to transcribe. When
use_torchaudio_streamingisFalse, uses SB fetching to allow fetching from HF or a local file. WhenTrue, resolves the URI through ffmpeg, as documented intorchaudio.io.StreamReader.dynchunktrain_config (DynChunkTrainConfig) β Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.
use_torchaudio_streaming (bool) β Whether the audio file can be loaded in a streaming fashion. If not, transcription is still performed through chunks of audio, but the entire audio file is fetched and loaded at once. This skips the usual fetching method and instead resolves the URI using torchaudio (via ffmpeg).
- Returns:
The audio file transcription produced by this ASR system.
- Return type:
- make_streaming_context(dynchunktrain_config: DynChunkTrainConfig)[source]ο
Create a blank streaming context to be passed around for chunk encoding/transcription.
- Parameters:
dynchunktrain_config (DynChunkTrainConfig) β Streaming configuration. Sane values and how much time chunks actually represent is model-dependent.
- Return type:
- get_chunk_size_frames(dynchunktrain_config: DynChunkTrainConfig) int[source]ο
Returns the chunk size in actual audio samples, i.e. the exact expected length along the time dimension of an input chunk tensor (as passed to
encode_chunk()and similar low-level streaming functions).- Parameters:
dynchunktrain_config (DynChunkTrainConfig) β The streaming configuration to determine the chunk frame count of.
- Return type:
chunk size
- encode_chunk(context: ASRStreamingContext, chunk: Tensor, chunk_len: Tensor | None = None)[source]ο
Encoding of a batch of audio chunks into a batch of encoded sequences. For full speech-to-text offline transcription, use
transcribe_batchortranscribe_file. Must be called over a given context in the correct order of chunks over time.- Parameters:
context (ASRStreamingContext) β Mutable streaming context object, which must be specified and reused across calls when streaming. You can obtain an initial context by calling
asr.make_streaming_context(config).chunk (torch.Tensor) β The tensor for an audio chunk of shape
[batch size, time]. The time dimension must strictly matchasr.get_chunk_size_frames(config). The waveform is expected to be in the modelβs expected format (i.e. the sampling rate must be correct).chunk_len (torch.Tensor, optional) β The relative chunk length tensor of shape
[batch size]. This is to be used when the audio in one of the chunks of the batch is ending within this chunk. If unspecified, equivalent totorch.ones((batch_size,)).
- Returns:
Encoded output, of a model-dependent shape.
- Return type:
- decode_chunk(context: ASRStreamingContext, x: Tensor) Tuple[List[str], List[List[int]]][source]ο
Decodes the output of the encoder into tokens and the associated transcription. Must be called over a given context in the correct order of chunks over time.
- Parameters:
context (ASRStreamingContext) β Mutable streaming context object, which should be the same object that was passed to
encode_chunk.x (torch.Tensor) β The output of
encode_chunkfor a given chunk.
- Returns:
list of str β Decoded tokens of length
batch_size. The decoded strings can be of 0-length.list of list of output token hypotheses β List of length
batch_size, each holding a list of tokens of any length>=0.
- transcribe_chunk(context: ASRStreamingContext, chunk: Tensor, chunk_len: Tensor | None = None)[source]ο
Transcription of a batch of audio chunks into transcribed text. Must be called over a given context in the correct order of chunks over time.
- Parameters:
context (ASRStreamingContext) β Mutable streaming context object, which must be specified and reused across calls when streaming. You can obtain an initial context by calling
asr.make_streaming_context(config).chunk (torch.Tensor) β The tensor for an audio chunk of shape
[batch size, time]. The time dimension must strictly matchasr.get_chunk_size_frames(config). The waveform is expected to be in the modelβs expected format (i.e. the sampling rate must be correct).chunk_len (torch.Tensor, optional) β The relative chunk length tensor of shape
[batch size]. This is to be used when the audio in one of the chunks of the batch is ending within this chunk. If unspecified, equivalent totorch.ones((batch_size,)).
- Returns:
Transcribed string for this chunk, might be of length zero.
- Return type:
- class speechbrain.inference.ASR.SpeechLLMASR(*args, **kwargs)[source]ο
Bases:
PretrainedA ready-to-use SpeechLLM ASR model interface.
The class can be used to run the entire speechllm model. First, the audio is encoded into a sequence of hidden states using the
speech_encoder. Then, the hidden states are downsampled using thefeat_downsamplerand projected using theprojmodule. The projected features are concatenated with the text embeddings and passed to thesearchermodule. Thesearchermodule returns the predicted tokens and the predicted words using an LLM decoder.The given YAML must contains the fields specified in the HPARAMS_NEEDED list.
Example
>>> from speechbrain.inference.ASR import SpeechLLMASR >>> tmpdir = getfixture("tmpdir") >>> asr_model = SpeechLLMASR.from_hparams( ... source="speechbrain/asr-speechllm-librispeech", ... savedir=tmpdir, ... ) >>> hyp = asr_model.transcribe_file( ... "speechbrain/asr-speechllm-librispeech/example-en.wav" ... ) >>> hyp THE BIRCH CANOE SLID ON THE SMOOTH PLANKS
- HPARAMS_NEEDED = ['bos_index', 'eos_index', 'prompt']ο
- MODULES_NEEDED = ['speech_encoder', 'feat_downsampler', 'proj', 'llm', 'normalize', 'searcher']ο
- build_multimodal_embds(audio_feats)[source]ο
Builds the multimodal embeddings for the audio features.
- encode_batch(wavs, wav_lens)[source]ο
Encodes the audio waveforms into a sequence of hidden states. By default, the
self.inference_ctxis used to run the forward pass. Can be overridden by passing a custom--precisionargument.- Parameters:
wavs (torch.Tensor) β The audio waveforms of shape (batch_size, time).
wav_lens (torch.Tensor) β The lengths of the audio waveforms of shape (batch_size,).
- Returns:
audio_feats β The encoded audio features of shape (batch_size, time, feat_dim).
- Return type:
- transcribe_batch(wavs, wav_lens)[source]ο
Transcribes the input audio into a sequence of words.
- Parameters:
wavs (torch.Tensor) β The audio waveforms of shape (batch_size, time).
wav_lens (torch.Tensor) β The lengths of the audio waveforms of shape (batch_size,).
- Returns:
predicted_words (list) β The predicted words of shape (batch_size,).
predicted_tokens (list) β The predicted tokens of shape (batch_size,).