speechbrain.inference.vocoders module

Specifies the inference interfaces for Text-To-Speech (TTS) modules.

Authors:
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • Adel Moumen 2023

  • Pradnya Kandarkar 2023

Summary

Classes:

DiffWaveVocoder

A ready-to-use inference wrapper for DiffWave as vocoder. The wrapper allows to perform generative tasks: locally-conditional generation: mel_spec -> waveform :param hparams: Hyperparameters (from HyperPyYAML).

HIFIGAN

A ready-to-use wrapper for HiFiGAN (mel_spec -> waveform).

UnitHIFIGAN

A ready-to-use wrapper for Unit HiFiGAN (discrete units -> waveform).

Reference

class speechbrain.inference.vocoders.HIFIGAN(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for HiFiGAN (mel_spec -> waveform). :param hparams: Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> mel_specs = torch.rand(2, 80,298)
>>> waveforms = hifi_gan.decode_batch(mel_specs)
>>> # You can use the vocoder coupled with a TTS system
>>> # Initialize TTS (tacotron2)
>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> from speechbrain.inference.TTS import Tacotron2
>>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)
HPARAMS_NEEDED = ['generator']
decode_batch(spectrogram, mel_lens=None, hop_len=None)[source]

Computes waveforms from a batch of mel-spectrograms :param spectrogram: Batch of mel-spectrograms [batch, mels, time] :type spectrogram: torch.Tensor :param mel_lens: A list of lengths of mel-spectrograms for the batch

Can be obtained from the output of Tacotron/FastSpeech

Parameters:

hop_len (int) – hop length used for mel-spectrogram extraction should be the same value as in the .yaml file

Returns:

waveforms – Batch of mel-waveforms [batch, 1, time]

Return type:

torch.Tensor

mask_noise(waveform, mel_lens, hop_len)[source]

Mask the noise caused by padding during batch inference :param wavform: Batch of generated waveforms [batch, 1, time] :type wavform: torch.tensor :param mel_lens: A list of lengths of mel-spectrograms for the batch

Can be obtained from the output of Tacotron/FastSpeech

Parameters:

hop_len (int) – hop length used for mel-spectrogram extraction same value as in the .yaml file

Returns:

waveform – Batch of waveforms without padded noise [batch, 1, time]

Return type:

torch.tensor

decode_spectrogram(spectrogram)[source]

Computes waveforms from a single mel-spectrogram :param spectrogram: mel-spectrogram [mels, time] :type spectrogram: torch.Tensor

Returns:

  • waveform (torch.Tensor) – waveform [1, time]

  • audio can be saved by

  • >>> import torchaudio

  • >>> waveform = torch.rand(1, 666666)

  • >>> sample_rate = 22050

  • >>> torchaudio.save(str(getfixture(‘tmpdir’) / “test.wav”), waveform, sample_rate)

forward(spectrogram)[source]

Decodes the input spectrograms

training: bool
class speechbrain.inference.vocoders.DiffWaveVocoder(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use inference wrapper for DiffWave as vocoder. The wrapper allows to perform generative tasks:

locally-conditional generation: mel_spec -> waveform

Parameters:

hparams – Hyperparameters (from HyperPyYAML)

HPARAMS_NEEDED = ['diffusion']
decode_batch(mel, hop_len, mel_lens=None, fast_sampling=False, fast_sampling_noise_schedule=None)[source]

Generate waveforms from spectrograms :param mel: spectrogram [batch, mels, time] :type mel: torch.tensor :param hop_len: Hop length during mel-spectrogram extraction

Should be the same value as in the .yaml file Used to determine the output wave length Also used to mask the noise for vocoding task

Parameters:
  • mel_lens (torch.tensor) – Used to mask the noise caused by padding A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech

  • fast_sampling (bool) – whether to do fast sampling

  • fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling

Returns:

waveforms – Batch of mel-waveforms [batch, 1, time]

Return type:

torch.tensor

mask_noise(waveform, mel_lens, hop_len)[source]

Mask the noise caused by padding during batch inference :param wavform: Batch of generated waveforms [batch, 1, time] :type wavform: torch.tensor :param mel_lens: A list of lengths of mel-spectrograms for the batch

Can be obtained from the output of Tacotron/FastSpeech

Parameters:

hop_len (int) – hop length used for mel-spectrogram extraction same value as in the .yaml file

Returns:

waveform – Batch of waveforms without padded noise [batch, 1, time]

Return type:

torch.tensor

decode_spectrogram(spectrogram, hop_len, fast_sampling=False, fast_sampling_noise_schedule=None)[source]

Computes waveforms from a single mel-spectrogram :param spectrogram: mel-spectrogram [mels, time] :type spectrogram: torch.tensor :param hop_len: hop length used for mel-spectrogram extraction

same value as in the .yaml file

Parameters:
  • fast_sampling (bool) – whether to do fast sampling

  • fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling

Returns:

  • waveform (torch.tensor) – waveform [1, time]

  • audio can be saved by

  • >>> import torchaudio

  • >>> waveform = torch.rand(1, 666666)

  • >>> sample_rate = 22050

  • >>> torchaudio.save(str(getfixture(‘tmpdir’) / “test.wav”), waveform, sample_rate)

forward(spectrogram)[source]

Decodes the input spectrograms

training: bool
class speechbrain.inference.vocoders.UnitHIFIGAN(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Unit HiFiGAN (discrete units -> waveform). :param hparams: Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> hifi_gan = UnitHIFIGAN.from_hparams(source="speechbrain/tts-hifigan-unit-hubert-l6-k100-ljspeech", savedir=tmpdir_vocoder)
>>> codes = torch.randint(0, 99, (100,))
>>> waveform = hifi_gan.decode_unit(codes)
HPARAMS_NEEDED = ['generator']
decode_batch(units)[source]

Computes waveforms from a batch of discrete units :param units: Batch of discrete units [batch, codes] :type units: torch.tensor

Returns:

waveforms – Batch of mel-waveforms [batch, 1, time]

Return type:

torch.tensor

decode_unit(units)[source]

Computes waveforms from a single sequence of discrete units :param units: codes: [time] :type units: torch.tensor

Returns:

waveform – waveform [1, time]

Return type:

torch.tensor

forward(units)[source]

Decodes the input units

training: bool