speechbrain.inference.ST module

Specifies the inference interfaces for Speech Translation (ST) modules.

Authors:
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • Adel Moumen 2023

  • Pradnya Kandarkar 2023

Summary

Classes:

EncoderDecoderS2UT

A ready-to-use Encoder Decoder for speech-to-unit translation model

Reference

class speechbrain.inference.ST.EncoderDecoderS2UT(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use Encoder Decoder for speech-to-unit translation model

The class can be used to run the entire encoder-decoder S2UT model (translate_file()) to translate speech. The given YAML must contains the fields specified in the *_NEEDED[] lists.

Parameters:
  • *args (tuple)

  • **kwargs (dict) – Arguments are forwarded to Pretrained parent class.

Example

>>> from speechbrain.inference.ST import EncoderDecoderS2UT
>>> tmpdir = getfixture("tmpdir")
>>> s2ut_model = EncoderDecoderS2UT.from_hparams(source="speechbrain/s2st-transformer-fr-en-hubert-l6-k100-cvss", savedir=tmpdir) 
>>> s2ut_model.translate_file("speechbrain/s2st-transformer-fr-en-hubert-l6-k100-cvss/example-fr.wav") 
HPARAMS_NEEDED = ['sample_rate']
MODULES_NEEDED = ['encoder', 'decoder']
translate_file(path)[source]

Translates the given audiofile into a sequence speech unit.

Parameters:

path (str) – Path to audio file which to translate.

Returns:

The audiofile translation produced by this speech-to-unit translationmodel.

Return type:

int[]

encode_batch(wavs, wav_lens)[source]

Encodes the input audio into a sequence of hidden states

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderS2UT.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.tensor) – Batch of waveforms [batch, time, channels].

  • wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

The encoded batch

Return type:

torch.tensor

translate_batch(wavs, wav_lens)[source]

Translates the input audio into a sequence of words

The waveforms should already be in the model’s desired format. You can call: normalized = EncoderDecoderS2UT.normalizer(signal, sample_rate) to get a correctly converted signal in most cases.

Parameters:
  • wavs (torch.tensor) – Batch of waveforms [batch, time, channels].

  • wav_lens (torch.tensor) – Lengths of the waveforms relative to the longest one in the batch, tensor of shape [batch]. The longest one should have relative length 1.0 and others len(waveform) / max_length. Used for ignoring padding.

Returns:

  • list – Each waveform in the batch translated.

  • tensor – Each predicted token id.

forward(wavs, wav_lens)[source]

Runs full translation