speechbrain.inference.TTS module
Specifies the inference interfaces for Text-To-Speech (TTS) modules.
- Authors:
Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023
Adel Moumen 2023
Pradnya Kandarkar 2023
Summary
Classes:
A ready-to-use wrapper for Fastspeech2 (text -> mel_spec). |
|
A ready-to-use wrapper for Fastspeech2 with internal alignment(text -> mel_spec). |
|
A ready-to-use wrapper for Zero-Shot Multi-Speaker Tacotron2. |
|
A ready-to-use wrapper for Tacotron2 (text -> mel_spec). |
Reference
- class speechbrain.inference.TTS.Tacotron2(*args, **kwargs)[source]
Bases:
Pretrained
A ready-to-use wrapper for Tacotron2 (text -> mel_spec).
- Parameters:
hparams – Hyperparameters (from HyperPyYAML)
Example
>>> tmpdir_tts = getfixture('tmpdir') / "tts" >>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts) >>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb") >>> items = [ ... "A quick brown fox jumped over the lazy dog", ... "How much wood would a woodchuck chuck?", ... "Never odd or even" ... ] >>> mel_outputs, mel_lengths, alignments = tacotron2.encode_batch(items)
>>> # One can combine the TTS model with a vocoder (that generates the final waveform) >>> # Intialize the Vocoder (HiFIGAN) >>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> from speechbrain.inference.vocoders import HIFIGAN >>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) >>> # Running the TTS >>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb") >>> # Running Vocoder (spectrogram-to-waveform) >>> waveforms = hifi_gan.decode_batch(mel_output)
- HPARAMS_NEEDED = ['model', 'text_to_sequence']
- class speechbrain.inference.TTS.MSTacotron2(*args, **kwargs)[source]
Bases:
Pretrained
A ready-to-use wrapper for Zero-Shot Multi-Speaker Tacotron2. For voice cloning: (text, reference_audio) -> (mel_spec). For generating a random speaker voice: (text) -> (mel_spec).
Example
>>> tmpdir_tts = getfixture('tmpdir') / "tts" >>> mstacotron2 = MSTacotron2.from_hparams(source="speechbrain/tts-mstacotron2-libritts", savedir=tmpdir_tts) >>> # Sample rate of the reference audio must be greater or equal to the sample rate of the speaker embedding model >>> reference_audio_path = "tests/samples/single-mic/example1.wav" >>> input_text = "Mary had a little lamb." >>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) >>> # One can combine the TTS model with a vocoder (that generates the final waveform) >>> # Intialize the Vocoder (HiFIGAN) >>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> from speechbrain.inference.vocoders import HIFIGAN >>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir=tmpdir_vocoder) >>> # Running the TTS >>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) >>> # Running Vocoder (spectrogram-to-waveform) >>> waveforms = hifi_gan.decode_batch(mel_output) >>> # For generating a random speaker voice, use the following >>> mel_output, mel_length, alignment = mstacotron2.generate_random_voice(input_text)
- HPARAMS_NEEDED = ['model']
- clone_voice(texts, audio_path)[source]
Generates mel-spectrogram using input text and reference audio
- class speechbrain.inference.TTS.FastSpeech2(*args, **kwargs)[source]
Bases:
Pretrained
A ready-to-use wrapper for Fastspeech2 (text -> mel_spec). :param hparams: Hyperparameters (from HyperPyYAML)
Example
>>> tmpdir_tts = getfixture('tmpdir') / "tts" >>> fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir=tmpdir_tts) >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) >>> items = [ ... "A quick brown fox jumped over the lazy dog", ... "How much wood would a woodchuck chuck?", ... "Never odd or even" ... ] >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) >>> >>> # One can combine the TTS model with a vocoder (that generates the final waveform) >>> # Intialize the Vocoder (HiFIGAN) >>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> from speechbrain.inference.vocoders import HIFIGAN >>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) >>> # Running the TTS >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) >>> # Running Vocoder (spectrogram-to-waveform) >>> waveforms = hifi_gan.decode_batch(mel_outputs)
- HPARAMS_NEEDED = ['spn_predictor', 'model', 'input_encoder']
- encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
Computes mel-spectrogram for a list of texts
- Parameters:
- Return type:
tensors of output spectrograms, output lengths and alignments
- encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
Computes mel-spectrogram for a list of phoneme sequences
- Parameters:
- Return type:
tensors of output spectrograms, output lengths and alignments
- encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
Batch inference for a tensor of phoneme sequences :param tokens_padded: A sequence of encoded phonemes to be converted to spectrogram :type tokens_padded: torch.Tensor :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float
- forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
Batch inference for a tensor of phoneme sequences :param text: A text to be converted to spectrogram :type text: str :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float
- class speechbrain.inference.TTS.FastSpeech2InternalAlignment(*args, **kwargs)[source]
Bases:
Pretrained
A ready-to-use wrapper for Fastspeech2 with internal alignment(text -> mel_spec). :param hparams: Hyperparameters (from HyperPyYAML)
Example
>>> tmpdir_tts = getfixture('tmpdir') / "tts" >>> fastspeech2 = FastSpeech2InternalAlignment.from_hparams(source="speechbrain/tts-fastspeech2-internal-alignment-ljspeech", savedir=tmpdir_tts) >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) >>> items = [ ... "A quick brown fox jumped over the lazy dog", ... "How much wood would a woodchuck chuck?", ... "Never odd or even" ... ] >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) >>> # One can combine the TTS model with a vocoder (that generates the final waveform) >>> # Intialize the Vocoder (HiFIGAN) >>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> from speechbrain.inference.vocoders import HIFIGAN >>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) >>> # Running the TTS >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) >>> # Running Vocoder (spectrogram-to-waveform) >>> waveforms = hifi_gan.decode_batch(mel_outputs)
- HPARAMS_NEEDED = ['model', 'input_encoder']
- encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
Computes mel-spectrogram for a list of texts
- Parameters:
- Return type:
tensors of output spectrograms, output lengths and alignments
- encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
Computes mel-spectrogram for a list of phoneme sequences
- Parameters:
- Return type:
tensors of output spectrograms, output lengths and alignments
- encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
Batch inference for a tensor of phoneme sequences :param tokens_padded: A sequence of encoded phonemes to be converted to spectrogram :type tokens_padded: torch.Tensor :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float
- forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]
Batch inference for a tensor of phoneme sequences :param text: A text to be converted to spectrogram :type text: str :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float