speechbrain.inference.TTS moduleο
Specifies the inference interfaces for Text-To-Speech (TTS) modules.
- Authors:
Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023
Adel Moumen 2023
Pradnya Kandarkar 2023
Summaryο
Classes:
A ready-to-use wrapper for Fastspeech2 (text -> mel_spec). |
|
A ready-to-use wrapper for Fastspeech2 with internal alignment(text -> mel_spec). |
|
A ready-to-use wrapper for Zero-Shot Multi-Speaker Tacotron2. |
|
A ready-to-use wrapper for Tacotron2 (text -> mel_spec). |
Referenceο
- class speechbrain.inference.TTS.Tacotron2(*args, **kwargs)[source]ο
Bases:
Pretrained
A ready-to-use wrapper for Tacotron2 (text -> mel_spec).
Example
>>> tmpdir_tts = getfixture('tmpdir') / "tts" >>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts) >>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb") >>> items = [ ... "A quick brown fox jumped over the lazy dog", ... "How much wood would a woodchuck chuck?", ... "Never odd or even" ... ] >>> mel_outputs, mel_lengths, alignments = tacotron2.encode_batch(items)
>>> # One can combine the TTS model with a vocoder (that generates the final waveform) >>> # Initialize the Vocoder (HiFIGAN) >>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> from speechbrain.inference.vocoders import HIFIGAN >>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) >>> # Running the TTS >>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb") >>> # Running Vocoder (spectrogram-to-waveform) >>> waveforms = hifi_gan.decode_batch(mel_output)
- HPARAMS_NEEDED = ['model', 'text_to_sequence']ο
- text_to_seq(txt)[source]ο
Encodes raw text into a tensor with a customer text-to-sequence function
- class speechbrain.inference.TTS.MSTacotron2(*args, **kwargs)[source]ο
Bases:
Pretrained
A ready-to-use wrapper for Zero-Shot Multi-Speaker Tacotron2. For voice cloning: (text, reference_audio) -> (mel_spec). For generating a random speaker voice: (text) -> (mel_spec).
Example
>>> tmpdir_tts = getfixture('tmpdir') / "tts" >>> mstacotron2 = MSTacotron2.from_hparams(source="speechbrain/tts-mstacotron2-libritts", savedir=tmpdir_tts) >>> # Sample rate of the reference audio must be greater or equal to the sample rate of the speaker embedding model >>> reference_audio_path = "tests/samples/single-mic/example1.wav" >>> input_text = "Mary had a little lamb." >>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) >>> # One can combine the TTS model with a vocoder (that generates the final waveform) >>> # Initialize the Vocoder (HiFIGAN) >>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> from speechbrain.inference.vocoders import HIFIGAN >>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir=tmpdir_vocoder) >>> # Running the TTS >>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) >>> # Running Vocoder (spectrogram-to-waveform) >>> waveforms = hifi_gan.decode_batch(mel_output) >>> # For generating a random speaker voice, use the following >>> mel_output, mel_length, alignment = mstacotron2.generate_random_voice(input_text)
- HPARAMS_NEEDED = ['model']ο
- class speechbrain.inference.TTS.FastSpeech2(*args, **kwargs)[source]ο
Bases:
Pretrained
A ready-to-use wrapper for Fastspeech2 (text -> mel_spec).
Example
>>> tmpdir_tts = getfixture('tmpdir') / "tts" >>> fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir=tmpdir_tts) >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) >>> items = [ ... "A quick brown fox jumped over the lazy dog", ... "How much wood would a woodchuck chuck?", ... "Never odd or even" ... ] >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) >>> >>> # One can combine the TTS model with a vocoder (that generates the final waveform) >>> # Initialize the Vocoder (HiFIGAN) >>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> from speechbrain.inference.vocoders import HIFIGAN >>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) >>> # Running the TTS >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) >>> # Running Vocoder (spectrogram-to-waveform) >>> waveforms = hifi_gan.decode_batch(mel_outputs)
- HPARAMS_NEEDED = ['spn_predictor', 'model', 'input_encoder']ο
- encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]ο
Computes mel-spectrogram for a list of texts
- Parameters:
- Return type:
tensors of output spectrograms, output lengths and alignments
- encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]ο
Computes mel-spectrogram for a list of phoneme sequences
- Parameters:
- Return type:
tensors of output spectrograms, output lengths and alignments
- encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]ο
Batch inference for a tensor of phoneme sequences
- Parameters:
- Returns:
post_mel_outputs (torch.Tensor)
durations (torch.Tensor)
pitch (torch.Tensor)
energy (torch.Tensor)
- class speechbrain.inference.TTS.FastSpeech2InternalAlignment(*args, **kwargs)[source]ο
Bases:
Pretrained
A ready-to-use wrapper for Fastspeech2 with internal alignment(text -> mel_spec).
Example
>>> tmpdir_tts = getfixture('tmpdir') / "tts" >>> fastspeech2 = FastSpeech2InternalAlignment.from_hparams(source="speechbrain/tts-fastspeech2-internal-alignment-ljspeech", savedir=tmpdir_tts) >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) >>> items = [ ... "A quick brown fox jumped over the lazy dog", ... "How much wood would a woodchuck chuck?", ... "Never odd or even" ... ] >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) >>> # One can combine the TTS model with a vocoder (that generates the final waveform) >>> # Initialize the Vocoder (HiFIGAN) >>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder" >>> from speechbrain.inference.vocoders import HIFIGAN >>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) >>> # Running the TTS >>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) >>> # Running Vocoder (spectrogram-to-waveform) >>> waveforms = hifi_gan.decode_batch(mel_outputs)
- HPARAMS_NEEDED = ['model', 'input_encoder']ο
- encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]ο
Computes mel-spectrogram for a list of texts
- Parameters:
- Return type:
tensors of output spectrograms, output lengths and alignments
- encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]ο
Computes mel-spectrogram for a list of phoneme sequences
- Parameters:
- Return type:
tensors of output spectrograms, output lengths and alignments
- encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]ο
Batch inference for a tensor of phoneme sequences
- Parameters:
- Returns:
post_mel_outputs (torch.Tensor)
durations (torch.Tensor)
pitch (torch.Tensor)
energy (torch.Tensor)