speechbrain.inference.TTS module

Specifies the inference interfaces for Text-To-Speech (TTS) modules.

Authors:
  • Aku Rouhe 2021

  • Peter Plantinga 2021

  • Loren Lugosch 2020

  • Mirco Ravanelli 2020

  • Titouan Parcollet 2021

  • Abdel Heba 2021

  • Andreas Nautsch 2022, 2023

  • Pooneh Mousavi 2023

  • Sylvain de Langen 2023

  • Adel Moumen 2023

  • Pradnya Kandarkar 2023

Summary

Classes:

FastSpeech2

A ready-to-use wrapper for Fastspeech2 (text -> mel_spec).

FastSpeech2InternalAlignment

A ready-to-use wrapper for Fastspeech2 with internal alignment(text -> mel_spec).

MSTacotron2

A ready-to-use wrapper for Zero-Shot Multi-Speaker Tacotron2.

Tacotron2

A ready-to-use wrapper for Tacotron2 (text -> mel_spec).

Reference

class speechbrain.inference.TTS.Tacotron2(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Tacotron2 (text -> mel_spec).

Parameters:

hparams – Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> tacotron2 = Tacotron2.from_hparams(source="speechbrain/tts-tacotron2-ljspeech", savedir=tmpdir_tts)
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, mel_lengths, alignments = tacotron2.encode_batch(items)
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Intialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder)
>>> # Running the TTS
>>> mel_output, mel_length, alignment = tacotron2.encode_text("Mary had a little lamb")
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output)
HPARAMS_NEEDED = ['model', 'text_to_sequence']
text_to_seq(txt)[source]

Encodes raw text into a tensor with a customer text-to-sequence function

encode_batch(texts)[source]

Computes mel-spectrogram for a list of texts

Texts must be sorted in decreasing order on their lengths

Parameters:

texts (List[str]) – texts to be encoded into spectrogram

Return type:

tensors of output spectrograms, output lengths and alignments

encode_text(text)[source]

Runs inference for a single text str

forward(texts)[source]

Encodes the input texts.

training: bool
class speechbrain.inference.TTS.MSTacotron2(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Zero-Shot Multi-Speaker Tacotron2. For voice cloning: (text, reference_audio) -> (mel_spec). For generating a random speaker voice: (text) -> (mel_spec).

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> mstacotron2 = MSTacotron2.from_hparams(source="speechbrain/tts-mstacotron2-libritts", savedir=tmpdir_tts) 
>>> # Sample rate of the reference audio must be greater or equal to the sample rate of the speaker embedding model
>>> reference_audio_path = "tests/samples/single-mic/example1.wav"
>>> input_text = "Mary had a little lamb."
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) 
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Intialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-libritts-22050Hz", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_output, mel_length, alignment = mstacotron2.clone_voice(input_text, reference_audio_path) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_output) 
>>> # For generating a random speaker voice, use the following
>>> mel_output, mel_length, alignment = mstacotron2.generate_random_voice(input_text) 
HPARAMS_NEEDED = ['model']
clone_voice(texts, audio_path)[source]

Generates mel-spectrogram using input text and reference audio

Parameters:
  • texts (str or list) – Input text

  • audio_path (str) – Reference audio

Return type:

tensors of output spectrograms, output lengths and alignments

generate_random_voice(texts)[source]

Generates mel-spectrogram using input text and a random speaker voice

Parameters:

texts (str or list) – Input text

Return type:

tensors of output spectrograms, output lengths and alignments

training: bool
class speechbrain.inference.TTS.FastSpeech2(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Fastspeech2 (text -> mel_spec). :param hparams: Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2.from_hparams(source="speechbrain/tts-fastspeech2-ljspeech", savedir=tmpdir_tts) 
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) 
>>>
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Intialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs) 
HPARAMS_NEEDED = ['spn_predictor', 'model', 'input_encoder']
encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Computes mel-spectrogram for a list of texts

Parameters:
  • texts (List[str]) – texts to be converted to spectrogram

  • pace (float) – pace for the speech synthesis

  • pitch_rate (float) – scaling factor for phoneme pitches

  • energy_rate (float) – scaling factor for phoneme energies

Return type:

tensors of output spectrograms, output lengths and alignments

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Computes mel-spectrogram for a list of phoneme sequences

Parameters:
  • phonemes (List[List[str]]) – phonemes to be converted to spectrogram

  • pace (float) – pace for the speech synthesis

  • pitch_rate (float) – scaling factor for phoneme pitches

  • energy_rate (float) – scaling factor for phoneme energies

Return type:

tensors of output spectrograms, output lengths and alignments

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Batch inference for a tensor of phoneme sequences :param tokens_padded: A sequence of encoded phonemes to be converted to spectrogram :type tokens_padded: torch.Tensor :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float

forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Batch inference for a tensor of phoneme sequences :param text: A text to be converted to spectrogram :type text: str :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float

training: bool
class speechbrain.inference.TTS.FastSpeech2InternalAlignment(*args, **kwargs)[source]

Bases: Pretrained

A ready-to-use wrapper for Fastspeech2 with internal alignment(text -> mel_spec). :param hparams: Hyperparameters (from HyperPyYAML)

Example

>>> tmpdir_tts = getfixture('tmpdir') / "tts"
>>> fastspeech2 = FastSpeech2InternalAlignment.from_hparams(source="speechbrain/tts-fastspeech2-internal-alignment-ljspeech", savedir=tmpdir_tts) 
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> items = [
...   "A quick brown fox jumped over the lazy dog",
...   "How much wood would a woodchuck chuck?",
...   "Never odd or even"
... ]
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(items) 
>>> # One can combine the TTS model with a vocoder (that generates the final waveform)
>>> # Intialize the Vocoder (HiFIGAN)
>>> tmpdir_vocoder = getfixture('tmpdir') / "vocoder"
>>> from speechbrain.inference.vocoders import HIFIGAN
>>> hifi_gan = HIFIGAN.from_hparams(source="speechbrain/tts-hifigan-ljspeech", savedir=tmpdir_vocoder) 
>>> # Running the TTS
>>> mel_outputs, durations, pitch, energy = fastspeech2.encode_text(["Mary had a little lamb."]) 
>>> # Running Vocoder (spectrogram-to-waveform)
>>> waveforms = hifi_gan.decode_batch(mel_outputs) 
HPARAMS_NEEDED = ['model', 'input_encoder']
encode_text(texts, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Computes mel-spectrogram for a list of texts

Parameters:
  • texts (List[str]) – texts to be converted to spectrogram

  • pace (float) – pace for the speech synthesis

  • pitch_rate (float) – scaling factor for phoneme pitches

  • energy_rate (float) – scaling factor for phoneme energies

Return type:

tensors of output spectrograms, output lengths and alignments

encode_phoneme(phonemes, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Computes mel-spectrogram for a list of phoneme sequences

Parameters:
  • phonemes (List[List[str]]) – phonemes to be converted to spectrogram

  • pace (float) – pace for the speech synthesis

  • pitch_rate (float) – scaling factor for phoneme pitches

  • energy_rate (float) – scaling factor for phoneme energies

Return type:

tensors of output spectrograms, output lengths and alignments

encode_batch(tokens_padded, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Batch inference for a tensor of phoneme sequences :param tokens_padded: A sequence of encoded phonemes to be converted to spectrogram :type tokens_padded: torch.Tensor :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float

training: bool
forward(text, pace=1.0, pitch_rate=1.0, energy_rate=1.0)[source]

Batch inference for a tensor of phoneme sequences :param text: A text to be converted to spectrogram :type text: str :param pace: pace for the speech synthesis :type pace: float :param pitch_rate: scaling factor for phoneme pitches :type pitch_rate: float :param energy_rate: scaling factor for phoneme energies :type energy_rate: float