speechbrain.inference.vocoders moduleο
Specifies the inference interfaces for Text-To-Speech (TTS) modules.
- Authors:
Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023
Adel Moumen 2023
Pradnya Kandarkar 2023
Summaryο
Classes:
A ready-to-use inference wrapper for DiffWave as vocoder. The wrapper allows to perform generative tasks: locally-conditional generation: mel_spec -> waveform. |
|
A ready-to-use wrapper for HiFiGAN (mel_spec -> waveform). |
|
A ready-to-use wrapper for Unit HiFiGAN (discrete units -> waveform). |
Referenceο
- class speechbrain.inference.vocoders.HIFIGAN(*args, **kwargs)[source]ο
Bases:
Pretrained
A ready-to-use wrapper for HiFiGAN (mel_spec -> waveform).
Example
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) >>> mel_specs = torch.rand(2, 80,298) >>> waveforms = hifi_gan.decode_batch(mel_specs) >>> # You can use the vocoder coupled with a TTS system >>> # Initialize TTS (tacotron2) >>> tmpdir_tts = getfixture('tmpdir') / "tts" >>> from speechbrain.inference.TTS import Tacotron2 >>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts) >>> # Running the TTS >>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb") >>> # Running Vocoder (spectrogram-to-waveform) >>> waveforms = hifi_gan.decode_batch(mel_output)
- HPARAMS_NEEDED = ['generator']ο
- decode_batch(spectrogram, mel_lens=None, hop_len=None)[source]ο
Computes waveforms from a batch of mel-spectrograms
- Parameters:
spectrogram (torch.Tensor) β Batch of mel-spectrograms [batch, mels, time]
mel_lens (torch.tensor) β A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech
hop_len (int) β hop length used for mel-spectrogram extraction should be the same value as in the .yaml file
- Returns:
waveforms β Batch of mel-waveforms [batch, 1, time]
- Return type:
torch.Tensor
- mask_noise(waveform, mel_lens, hop_len)[source]ο
Mask the noise caused by padding during batch inference
- Parameters:
waveform (torch.tensor) β Batch of generated waveforms [batch, 1, time]
mel_lens (torch.tensor) β A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech
hop_len (int) β hop length used for mel-spectrogram extraction same value as in the .yaml file
- Returns:
waveform β Batch of waveforms without padded noise [batch, 1, time]
- Return type:
torch.tensor
- decode_spectrogram(spectrogram)[source]ο
Computes waveforms from a single mel-spectrogram
- Parameters:
spectrogram (torch.Tensor) β mel-spectrogram [mels, time]
- Returns:
waveform (torch.Tensor) β waveform [1, time]
audio can be saved by
>>> import torchaudio
>>> waveform = torch.rand(1, 666666)
>>> sample_rate = 22050
>>> torchaudio.save(str(getfixture(βtmpdirβ) / βtest.wavβ), waveform, sample_rate)
- class speechbrain.inference.vocoders.DiffWaveVocoder(*args, **kwargs)[source]ο
Bases:
Pretrained
A ready-to-use inference wrapper for DiffWave as vocoder. The wrapper allows to perform generative tasks:
locally-conditional generation: mel_spec -> waveform
- HPARAMS_NEEDED = ['diffusion']ο
- decode_batch(mel, hop_len, mel_lens=None, fast_sampling=False, fast_sampling_noise_schedule=None)[source]ο
Generate waveforms from spectrograms
- Parameters:
mel (torch.tensor) β spectrogram [batch, mels, time]
hop_len (int) β Hop length during mel-spectrogram extraction Should be the same value as in the .yaml file Used to determine the output wave length Also used to mask the noise for vocoding task
mel_lens (torch.tensor) β Used to mask the noise caused by padding A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech
fast_sampling (bool) β whether to do fast sampling
fast_sampling_noise_schedule (list) β the noise schedules used for fast sampling
- Returns:
waveforms β Batch of mel-waveforms [batch, 1, time]
- Return type:
torch.tensor
- mask_noise(waveform, mel_lens, hop_len)[source]ο
Mask the noise caused by padding during batch inference
- Parameters:
waveform (torch.tensor) β Batch of generated waveforms [batch, 1, time]
mel_lens (torch.tensor) β A list of lengths of mel-spectrograms for the batch Can be obtained from the output of Tacotron/FastSpeech
hop_len (int) β hop length used for mel-spectrogram extraction same value as in the .yaml file
- Returns:
waveform β Batch of waveforms without padded noise [batch, 1, time]
- Return type:
torch.tensor
- decode_spectrogram(spectrogram, hop_len, fast_sampling=False, fast_sampling_noise_schedule=None)[source]ο
Computes waveforms from a single mel-spectrogram
- Parameters:
- Returns:
waveform (torch.tensor) β waveform [1, time]
audio can be saved by
>>> import torchaudio
>>> waveform = torch.rand(1, 666666)
>>> sample_rate = 22050
>>> torchaudio.save(str(getfixture(βtmpdirβ) / βtest.wavβ), waveform, sample_rate)
- class speechbrain.inference.vocoders.UnitHIFIGAN(*args, **kwargs)[source]ο
Bases:
Pretrained
A ready-to-use wrapper for Unit HiFiGAN (discrete units -> waveform).
Example
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> hifi_gan = UnitHIFIGAN.from_hparams(source="speechbrain/hifigan-hubert-l1-3-7-12-18-23-k1000-LibriTTS", savedir=tmpdir_vocoder) >>> codes = torch.randint(0, 99, (100, 1)) >>> waveform = hifi_gan.decode_unit(codes)
- HPARAMS_NEEDED = ['generator']ο
- decode_batch(units, spk=None)[source]ο
Computes waveforms from a batch of discrete units
- Parameters:
units (torch.tensor) β Batch of discrete units [batch, codes]
spk (torch.tensor) β Batch of speaker embeddings [batch, spk_dim]
- Returns:
waveforms β Batch of mel-waveforms [batch, 1, time]
- Return type:
torch.tensor