speechbrain.inference.vocoders module

Specifies the inference interfaces for Text-To-Speech (TTS) modules.

Authors:

Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023
Adel Moumen 2023
Pradnya Kandarkar 2023

Summary

Classes:

`DiffWaveVocoder`	A ready-to-use inference wrapper for DiffWave as vocoder. The wrapper allows to perform generative tasks: locally-conditional generation: mel_spec -> waveform :param hparams: Hyperparameters (from HyperPyYAML).
`HIFIGAN`	A ready-to-use wrapper for HiFiGAN (mel_spec -> waveform).
`UnitHIFIGAN`	A ready-to-use wrapper for Unit HiFiGAN (discrete units -> waveform).

Reference

class speechbrain.inference.vocoders.HIFIGAN(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for HiFiGAN (mel_spec -> waveform). :param hparams: Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> mel_specs = torch.rand(2, 80,298)
>>> waveforms = hifi_gan.decode_batch(mel_specs)
>>> # You can use the vocoder coupled with a TTS system
>>> # Initialize TTS (tacotron2)
>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> from speechbrain.inference.TTS import Tacotron2
>>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)

HPARAMS_NEEDED = ['generator']

decode_batch(spectrogram, mel_lens=None, hop_len=None)[source]

Computes waveforms from a batch of mel-spectrograms :param spectrogram: Batch of mel-spectrograms [batch, mels, time] :type spectrogram: torch.Tensor :param mel_lens: A list of lengths of mel-spectrograms for the batch

Can be obtained from the output of Tacotron/FastSpeech

Parameters:: hop_len (int) – hop length used for mel-spectrogram extraction should be the same value as in the .yaml file
Returns:: waveforms – Batch of mel-waveforms [batch, 1, time]
Return type:: torch.Tensor

mask_noise(waveform, mel_lens, hop_len)[source]

Mask the noise caused by padding during batch inference :param wavform: Batch of generated waveforms [batch, 1, time] :type wavform: torch.tensor :param mel_lens: A list of lengths of mel-spectrograms for the batch

Can be obtained from the output of Tacotron/FastSpeech

Parameters:: hop_len (int) – hop length used for mel-spectrogram extraction same value as in the .yaml file
Returns:: waveform – Batch of waveforms without padded noise [batch, 1, time]
Return type:: torch.tensor

decode_spectrogram(spectrogram)[source]

Computes waveforms from a single mel-spectrogram :param spectrogram: mel-spectrogram [mels, time] :type spectrogram: torch.Tensor

Returns:

waveform (torch.Tensor) – waveform [1, time]
audio can be saved by
>>> import torchaudio
>>> waveform = torch.rand(1, 666666)
>>> sample_rate = 22050
>>> torchaudio.save(str(getfixture(‘tmpdir’) / “test.wav”), waveform, sample_rate)

forward(spectrogram)[source]: Decodes the input spectrograms

training: bool

class speechbrain.inference.vocoders.DiffWaveVocoder(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use inference wrapper for DiffWave as vocoder. The wrapper allows to perform generative tasks:

locally-conditional generation: mel_spec -> waveform

Parameters:: hparams – Hyperparameters (from HyperPyYAML)

HPARAMS_NEEDED = ['diffusion']

decode_batch(mel, hop_len, mel_lens=None, fast_sampling=False, fast_sampling_noise_schedule=None)[source]

Generate waveforms from spectrograms :param mel: spectrogram [batch, mels, time] :type mel: torch.tensor :param hop_len: Hop length during mel-spectrogram extraction

Should be the same value as in the .yaml file Used to determine the output wave length Also used to mask the noise for vocoding task

Parameters:

mel_lens (torch.tensor) – Used to mask the noise caused by padding A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech
fast_sampling (bool) – whether to do fast sampling
fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling

Returns:

waveforms – Batch of mel-waveforms [batch, 1, time]

Return type:

torch.tensor

mask_noise(waveform, mel_lens, hop_len)[source]

Mask the noise caused by padding during batch inference :param wavform: Batch of generated waveforms [batch, 1, time] :type wavform: torch.tensor :param mel_lens: A list of lengths of mel-spectrograms for the batch

Can be obtained from the output of Tacotron/FastSpeech

Parameters:: hop_len (int) – hop length used for mel-spectrogram extraction same value as in the .yaml file
Returns:: waveform – Batch of waveforms without padded noise [batch, 1, time]
Return type:: torch.tensor

decode_spectrogram(spectrogram, hop_len, fast_sampling=False, fast_sampling_noise_schedule=None)[source]

Computes waveforms from a single mel-spectrogram :param spectrogram: mel-spectrogram [mels, time] :type spectrogram: torch.tensor :param hop_len: hop length used for mel-spectrogram extraction

same value as in the .yaml file

Parameters:

fast_sampling (bool) – whether to do fast sampling
fast_sampling_noise_schedule (list) – the noise schedules used for fast sampling

Returns:

waveform (torch.tensor) – waveform [1, time]
audio can be saved by
>>> import torchaudio
>>> waveform = torch.rand(1, 666666)
>>> sample_rate = 22050
>>> torchaudio.save(str(getfixture(‘tmpdir’) / “test.wav”), waveform, sample_rate)

forward(spectrogram)[source]: Decodes the input spectrograms

training: bool

class speechbrain.inference.vocoders.UnitHIFIGAN(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Unit HiFiGAN (discrete units -> waveform). :param hparams: Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> hifi_gan = UnitHIFIGAN.from_hparams(source="speechbrain/tts-hifigan-unit-hubert-l6-k100-ljspeech", savedir=tmpdir_vocoder)
>>> codes = torch.randint(0, 99, (100,))
>>> waveform = hifi_gan.decode_unit(codes)

HPARAMS_NEEDED = ['generator']

decode_batch(units)[source]

Computes waveforms from a batch of discrete units :param units: Batch of discrete units [batch, codes] :type units: torch.tensor

Returns:: waveforms – Batch of mel-waveforms [batch, 1, time]
Return type:: torch.tensor

decode_unit(units)[source]

Computes waveforms from a single sequence of discrete units :param units: codes: [time] :type units: torch.tensor

Returns:: waveform – waveform [1, time]
Return type:: torch.tensor

forward(units)[source]: Decodes the input units

training: bool