speechbrain.inference.vocoders module

Specifies the inference interfaces for Text-To-Speech (TTS) modules.

Authors:
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • Adel Moumen 2023

  • Pradnya Kandarkar 2023

Summary

Classes:

DiffWaveVocoder

A ready-to-use inference wrapper for DiffWave as vocoder. The wrapper allows to perform generative tasks: locally-conditional generation: mel_spec -> waveform.

HIFIGAN

A ready-to-use wrapper for HiFiGAN (mel_spec -> waveform).

UnitHIFIGAN

A ready-to-use wrapper for Unit HiFiGAN (discrete units -> waveform).

Reference

class speechbrain.inference.vocoders.HIFIGAN(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for HiFiGAN (mel_spec -> waveform).

Parameters:
  • *args (tuple)

  • **kwargs (dict) – Arguments are forwarded to Pretrained parent class.

Example

>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> mel_specs = torch.rand(2, 80,298)
>>> waveforms = hifi_gan.decode_batch(mel_specs)
>>> # You can use the vocoder coupled with a TTS system
>>> # Initialize TTS (tacotron2)
>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> from speechbrain.inference.TTS import Tacotron2
>>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)
HPARAMS_NEEDED = ['generator']
decode_batch(spectrogram, mel_lens=None, hop_len=None)[source]

Computes waveforms from a batch of mel-spectrograms

Parameters:
  • spectrogram (torch.Tensor) – Batch of mel-spectrograms [batch, mels, time]

  • mel_lens (torch.tensor) – A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech

  • hop_len (int) – hop length used for mel-spectrogram extraction should be the same value as in the .yaml file

Returns:

waveforms – Batch of mel-waveforms [batch, 1, time]

Return type:

torch.Tensor

mask_noise(waveform, mel_lens, hop_len)[source]

Mask the noise caused by padding during batch inference

Parameters:
  • waveform (torch.tensor) – Batch of generated waveforms [batch, 1, time]

  • mel_lens (torch.tensor) – A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech

  • hop_len (int) – hop length used for mel-spectrogram extraction same value as in the .yaml file

Returns:

waveform – Batch of waveforms without padded noise [batch, 1, time]

Return type:

torch.tensor

decode_spectrogram(spectrogram)[source]

Computes waveforms from a single mel-spectrogram

Parameters:

spectrogram (torch.Tensor) – mel-spectrogram [mels, time]

Returns:

  • waveform (torch.Tensor) – waveform [1, time]

  • audio can be saved by

  • >>> import torchaudio

  • >>> waveform = torch.rand(1, 666666)

  • >>> sample_rate = 22050

  • >>> torchaudio.save(str(getfixture(β€˜tmpdir’) / β€œtest.wav”), waveform, sample_rate)

forward(spectrogram)[source]

Decodes the input spectrograms

class speechbrain.inference.vocoders.DiffWaveVocoder(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use inference wrapper for DiffWave as vocoder. The wrapper allows to perform generative tasks:

locally-conditional generation: mel_spec -> waveform

Parameters:
  • *args (tuple)

  • **kwargs (dict) – Arguments are forwarded to Pretrained parent class.

HPARAMS_NEEDED = ['diffusion']
decode_batch(mel, hop_len, mel_lens=None, fast_sampling=False, fast_sampling_noise_schedule=None)[source]

Generate waveforms from spectrograms

Parameters:
  • mel (torch.tensor) – spectrogram [batch, mels, time]

  • hop_len (int) – Hop length during mel-spectrogram extraction Should be the same value as in the .yaml file Used to determine the output wave length Also used to mask the noise for vocoding task

  • mel_lens (torch.tensor) – Used to mask the noise caused by padding A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech

  • fast_sampling (bool) – whether to do fast sampling

  • fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling

Returns:

waveforms – Batch of mel-waveforms [batch, 1, time]

Return type:

torch.tensor

mask_noise(waveform, mel_lens, hop_len)[source]

Mask the noise caused by padding during batch inference

Parameters:
  • waveform (torch.tensor) – Batch of generated waveforms [batch, 1, time]

  • mel_lens (torch.tensor) – A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech

  • hop_len (int) – hop length used for mel-spectrogram extraction same value as in the .yaml file

Returns:

waveform – Batch of waveforms without padded noise [batch, 1, time]

Return type:

torch.tensor

decode_spectrogram(spectrogram, hop_len, fast_sampling=False, fast_sampling_noise_schedule=None)[source]

Computes waveforms from a single mel-spectrogram

Parameters:
  • spectrogram (torch.tensor) – mel-spectrogram [mels, time]

  • hop_len (int) – hop length used for mel-spectrogram extraction same value as in the .yaml file

  • fast_sampling (bool) – whether to do fast sampling

  • fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling

Returns:

  • waveform (torch.tensor) – waveform [1, time]

  • audio can be saved by

  • >>> import torchaudio

  • >>> waveform = torch.rand(1, 666666)

  • >>> sample_rate = 22050

  • >>> torchaudio.save(str(getfixture(β€˜tmpdir’) / β€œtest.wav”), waveform, sample_rate)

forward(spectrogram)[source]

Decodes the input spectrograms

class speechbrain.inference.vocoders.UnitHIFIGAN(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Unit HiFiGAN (discrete units -> waveform).

Parameters:
  • *args (tuple) – See Pretrained

  • **kwargs (dict) – See Pretrained

Example

>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> hifi_gan = UnitHIFIGAN.from_hparams(source="speechbrain/hifigan-hubert-l1-3-7-12-18-23-k1000-LibriTTS", savedir=tmpdir_vocoder)
>>> codes = torch.randint(0, 99, (100, 1))
>>> waveform = hifi_gan.decode_unit(codes)
HPARAMS_NEEDED = ['generator']
decode_batch(units, spk=None)[source]

Computes waveforms from a batch of discrete units

Parameters:
  • units (torch.tensor) – Batch of discrete units [batch, codes]

  • spk (torch.tensor) – Batch of speaker embeddings [batch, spk_dim]

Returns:

waveforms – Batch of mel-waveforms [batch, 1, time]

Return type:

torch.tensor

decode_unit(units, spk=None)[source]

Computes waveforms from a single sequence of discrete units :param units: codes: [time] :type units: torch.tensor :param spk: spk: [spk_dim] :type spk: torch.tensor

Returns:

waveform – waveform [1, time]

Return type:

torch.tensor

forward(units, spk=None)[source]

Decodes the input units