speechbrain.inference.TTS module

Specifies the inference interfaces for Text-To-Speech (TTS) modules.

Authors:

Aku Rouhe 2021
Peter Plantinga 2021
Loren Lugosch 2020
Mirco Ravanelli 2020
Titouan Parcollet 2021
Abdel Heba 2021
Andreas Nautsch 2022, 2023
Pooneh Mousavi 2023
Sylvain de Langen 2023
Adel Moumen 2023
Pradnya Kandarkar 2023

Summary

Classes:

`FastSpeech2`	A ready-to-use wrapper for Fastspeech2 (text -> mel_spec).
`FastSpeech2InternalAlignment`	A ready-to-use wrapper for Fastspeech2 with internal alignment(text -> mel_spec).
`MSTacotron2`	A ready-to-use wrapper for Zero-Shot Multi-Speaker Tacotron2.
`Tacotron2`	A ready-to-use wrapper for Tacotron2 (text -> mel_spec).

Reference

class speechbrain.inference.TTS.Tacotron2(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Tacotron2 (text -> mel_spec).

Parameters:: hparams – Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts)
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, mel_lengths, alignments = tacotron2.encode_batch(items)

>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Intialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)

HPARAMS_NEEDED = ['model', 'text_to_sequence']

text_to_seq(txt)[source]: Encodes raw text into a tensor with a customer text-to-sequence function

encode_batch(texts)[source]

Computes mel-spectrogram for a list of texts

Texts must be sorted in decreasing order on their lengths

Parameters:: texts (List[str]) – texts to be encoded into spectrogram
Return type:: tensors of output spectrograms, output lengths and alignments

encode_text(text)[source]: Runs inference for a single text str

forward(texts)[source]: Encodes the input texts.

training: bool

class speechbrain.inference.TTS.MSTacotron2(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Zero-Shot Multi-Speaker Tacotron2. For voice cloning: (text, reference_audio) -> (mel_spec). For generating a random speaker voice: (text) -> (mel_spec).

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> mstacotron2 = MSTacotron2.from_hparams(source="speechbrain/tts-mstacotron2-libritts", savedir=tmpdir_tts) 
>>> # Sample rate of the reference audio must be greater or equal to the sample rate of the speaker embedding model
>>> reference_audio_path = "tests/samples/single-mic/example1.wav"
>>> input_text = "Mary had a little lamb."
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) 
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Intialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output) 
>>> # For generating a random speaker voice, use the following
>>> mel_output, mel_length, alignment = mstacotron2.generate_random_voice(input_text) 

HPARAMS_NEEDED = ['model']

clone_voice(texts, audio_path)[source]

Generates mel-spectrogram using input text and reference audio

Parameters:

texts (str or list) – Input text
audio_path (str) – Reference audio

Return type:

tensors of output spectrograms, output lengths and alignments

generate_random_voice(texts)[source]

Generates mel-spectrogram using input text and a random speaker voice

Parameters:: texts (str or list) – Input text
Return type:: tensors of output spectrograms, output lengths and alignments

training: bool

class speechbrain.inference.TTS.FastSpeech2(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Fastspeech2 (text -> mel_spec). :param hparams: Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir=tmpdir_tts) 
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) 
>>>
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Intialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs) 

HPARAMS_NEEDED = ['spn_predictor', 'model', 'input_encoder']

encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Computes mel-spectrogram for a list of texts

Parameters:

texts (List[str]) – texts to be converted to spectrogram
pace (float) – pace for the speech synthesis
pitch_rate (float) – scaling factor for phoneme pitches
energy_rate (float) – scaling factor for phoneme energies

Return type:

tensors of output spectrograms, output lengths and alignments

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Computes mel-spectrogram for a list of phoneme sequences

Parameters:

phonemes (List[List[str]]) – phonemes to be converted to spectrogram
pace (float) – pace for the speech synthesis
pitch_rate (float) – scaling factor for phoneme pitches
energy_rate (float) – scaling factor for phoneme energies

Return type:

tensors of output spectrograms, output lengths and alignments

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]: Batch inference for a tensor of phoneme sequences :param tokens_padded: A sequence of encoded phonemes to be converted to spectrogram :type tokens_padded: torch.Tensor :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]: Batch inference for a tensor of phoneme sequences :param text: A text to be converted to spectrogram :type text: str :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float

training: bool

class speechbrain.inference.TTS.FastSpeech2InternalAlignment(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Fastspeech2 with internal alignment(text -> mel_spec). :param hparams: Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2InternalAlignment.from_hparams(source="speechbrain/tts-fastspeech2-internal-alignment-ljspeech", savedir=tmpdir_tts) 
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) 
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Intialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs) 

HPARAMS_NEEDED = ['model', 'input_encoder']

encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Computes mel-spectrogram for a list of texts

Parameters:

texts (List[str]) – texts to be converted to spectrogram
pace (float) – pace for the speech synthesis
pitch_rate (float) – scaling factor for phoneme pitches
energy_rate (float) – scaling factor for phoneme energies

Return type:

tensors of output spectrograms, output lengths and alignments

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Computes mel-spectrogram for a list of phoneme sequences

Parameters:

phonemes (List[List[str]]) – phonemes to be converted to spectrogram
pace (float) – pace for the speech synthesis
pitch_rate (float) – scaling factor for phoneme pitches
energy_rate (float) – scaling factor for phoneme energies

Return type:

tensors of output spectrograms, output lengths and alignments

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]: Batch inference for a tensor of phoneme sequences :param tokens_padded: A sequence of encoded phonemes to be converted to spectrogram :type tokens_padded: torch.Tensor :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float

training: bool

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]: Batch inference for a tensor of phoneme sequences :param text: A text to be converted to spectrogram :type text: str :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float