Checkpoints#

There are two main ways to load pretrained checkpoints in NeMo as described inCheckpoints.

  • Using therestore_from() method to load a local checkpoint file (.nemo), or

  • Using thefrom_pretrained() method to download and set up a checkpoint from NGC.

Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning. For resuming an unfinishedtraining experiment, use the Experiment Manager to do so by setting theresume_if_exists flag toTrue.

Local Checkpoints#

  • Save Model Checkpoints: NeMo automatically saves final model checkpoints with.nemo suffix. You could also manually save any model checkpoint usingmodel.save_to(<checkpoint_path>.nemo).

  • Load Model Checkpoints: if you’d like to load a checkpoint saved at<path/to/checkpoint/file.nemo>, use therestore_from() method below, where<MODEL_BASE_CLASS> is the TTS model class of the original checkpoint.

importnemo.collections.ttsasnemo_ttsmodel=nemo_tts.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")

NGC Pretrained Checkpoints#

The NGCNeMo Text to Speech collection aggregates model cards that contain detailed information about checkpoints of various models trained on various datasets. The tables below inCheckpoints list part of available TTS models from NGC including speech/text aligners, acoustic models, and vocoders.

Load Model Checkpoints#

The models can be accessed via thefrom_pretrained() method inside the TTS Model class. In general, you can load any of these models with code in the following format,

importnemo.collections.ttsasnemo_ttsmodel=nemo_tts.models.<MODEL_BASE_CLASS>.from_pretrained(model_name="<MODEL_NAME>")

where<MODEL_NAME> is the value inModelName column in the tables inCheckpoints. These names are predefined in the each model’s member functionself.list_available_models(). For example, the available NGC FastPitch model names can be found,

In [1]: import nemo.collections.tts as nemo_ttsIn [2]: nemo_tts.models.FastPitchModel.list_available_models()Out[2]:[PretrainedModelInfo(    pretrained_model_name=tts_en_fastpitch,    description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is ARPABET-based.,    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.8.1/files/tts_en_fastpitch_align.nemo,    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> ), PretrainedModelInfo(    pretrained_model_name=tts_en_fastpitch_ipa,    description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is IPA-based.,    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/IPA_1.13.0/files/tts_en_fastpitch_align_ipa.nemo,    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> ), PretrainedModelInfo(    pretrained_model_name=tts_en_fastpitch_multispeaker,    description=This model is trained on HiFITTS sampled at 44100Hz with and can be used to generate male and female English voices with an American accent.,    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_multispeaker_fastpitchhifigan/versions/1.10.0/files/tts_en_fastpitch_multispeaker.nemo,    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> ), PretrainedModelInfo(    pretrained_model_name=tts_de_fastpitch_singlespeaker,    description=This model is trained on a single male speaker data in OpenSLR Neutral German Dataset sampled at 22050Hz and can be used to generate male German voices.,    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.10.0/files/tts_de_fastpitch_align.nemo,    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> ), PretrainedModelInfo(    pretrained_model_name=tts_de_fastpitch_multispeaker_5,    description=This model is trained on 5 speakers in HUI-Audio-Corpus-German clean subset sampled at 44100Hz with and can be used to generate male and female German voices.,    location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitch_multispeaker_5/versions/1.11.0/files/tts_de_fastpitch_multispeaker_5.nemo,    class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> )]

From the above key-value pairpretrained_model_name=tts_en_fastpitch, you could get the model nametts_en_fastpitch and load it by running,

model=nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch")

If you would like to programmatically list the models available for a particular base class, you can use thelist_available_models() method,

nemo_tts.models.<MODEL_BASE_CLASS>.list_available_models()

Inference and Audio Generation#

NeMo TTS supports both cascaded and end-to-end models to synthesize audios. Most of steps in between are the same except that cascaded models need to load an extra vocoder model before generating audios. Below code snippet demonstrates steps of generating a audio sample from a text input using a cascaded FastPitch and HiFiGAN models. Please refer toNeMo TTS Collection API for detailed implementation of model classes.

importnemo.collections.ttsasnemo_tts# Load mel spectrogram generatorspec_generator=nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch")# Load vocodervocoder=nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan")# Generate audioimportsoundfileassfparsed=spec_generator.parse("You can type your sentence here to get nemo to produce speech.")spectrogram=spec_generator.generate_spectrogram(tokens=parsed)audio=vocoder.convert_spectrogram_to_audio(spec=spectrogram)# Save the audio to disk in a file called speech.wavsf.write("speech.wav",audio.to('cpu').numpy(),22050)

Fine-Tuning on Different Datasets#

There are multiple TTS tutorials provided in the directory oftutorials/tts/. Most of these tutorials demonstrate how to instantiate a pre-trained model, and prepare the model for fine-tuning on datasets with the same language or different languages, the same speaker or different speakers.

NGC TTS Models#

This section summarizes a full list of available NeMo TTS models that have been released inNGC NeMo Text to Speech Collection. You can download model checkpoints of your interest via either way below,

  • wget'<CHECKPOINT_URL_IN_THE_TABLE>'

  • curl-LO'<CHECKPOINT_URL_IN_THE_TABLE>'

Speech/Text Aligners#

Locale

Model Name

Dataset

Sampling Rate

#Spk

Phoneme Unit

Model Class

Overview

Checkpoint

en-US

tts_en_radtts_aligner

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.aligner.AlignerModel

tts_en_radtts_aligner

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_radtts_aligner/versions/ARPABET_1.11.0/files/Aligner.nemo

en-US

tts_en_radtts_aligner_ipa

LJSpeech

22050Hz

1

IPA

nemo.collections.tts.models.aligner.AlignerModel

tts_en_radtts_aligner

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_radtts_aligner/versions/IPA_1.13.0/files/Aligner.nemo

Mel-Spectrogram Generators#

Locale

Model Name

Dataset

Sampling Rate

#Spk

Symbols

Model Class

Overview

Checkpoint

en-US

tts_en_fastpitch

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_en_fastpitch

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.8.1/files/tts_en_fastpitch_align.nemo

en-US

tts_en_fastpitch_ipa

LJSpeech

22050Hz

1

IPA

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_en_fastpitch

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/IPA_1.13.0/files/tts_en_fastpitch_align_ipa.nemo

en-US

tts_en_fastpitch_multispeaker

HiFiTTS

44100Hz

10

ARPABET

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_en_multispeaker_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_multispeaker_fastpitchhifigan/versions/1.10.0/files/tts_en_fastpitch_multispeaker.nemo

en-US

tts_en_lj_mixertts

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.mixer_tts.MixerTTSModel

tts_en_lj_mixertts

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_mixertts/versions/1.6.0/files/tts_en_lj_mixertts.nemo

en-US

tts_en_lj_mixerttsx

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.mixer_tts.MixerTTSModel

tts_en_lj_mixerttsx

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_mixerttsx/versions/1.6.0/files/tts_en_lj_mixerttsx.nemo

en-US

RAD-TTS

TBD

TBD

TBD

ARPABET

nemo.collections.tts.models.radtts.RadTTSModel

TBD

en-US

tts_en_tacotron2

LJSpeech

22050Hz

1

ARPABET

nemo.collections.tts.models.tacotron2.Tacotron2Model

tts_en_tacotron2

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_tacotron2/versions/1.10.0/files/tts_en_tacotron2.nemo

de-DE

tts_de_fastpitch_multispeaker_5

HUI Audio Corpus German

44100Hz

5

ARPABET

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_de_fastpitch_multispeaker_5

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitch_multispeaker_5/versions/1.11.0/files/tts_de_fastpitch_multispeaker_5.nemo

de-DE

tts_de_fastpitch_singleSpeaker_thorstenNeutral_2102

Thorsten Müller Neutral 21.02 dataset

22050Hz

1

Graphemes

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_de_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.15.0/files/tts_de_fastpitch_thorstens2102.nemo

de-DE

tts_de_fastpitch_singleSpeaker_thorstenNeutral_2210

Thorsten Müller Neutral 22.10 dataset

22050Hz

1

Graphemes

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_de_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.15.0/files/tts_de_fastpitch_thorstens2210.nemo

es

tts_es_fastpitch_multispeaker

OpenSLR crowdsourced Latin American Spanish

44100Hz

174

IPA

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_es_multispeaker_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_es_multispeaker_fastpitchhifigan/versions/1.15.0/files/tts_es_fastpitch_multispeaker.nemo

zh-CN

tts_zh_fastpitch_sfspeech

SFSpeech Chinese/English Bilingual Speech

22050Hz

1

pinyin

nemo.collections.tts.models.fastpitch.FastPitchModel

tts_zh_fastpitch_hifigan_sfspeech

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_zh_fastpitch_hifigan_sfspeech/versions/1.15.0/files/tts_zh_fastpitch_sfspeech.nemo

Vocoders#

Locale

Model Name

Spectrogram Generator

Dataset

Sampling Rate

#Spk

Model Class

Overview

Checkpoint

en-US

tts_en_hifigan

librosa.filters.mel

LJSpeech

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_en_hifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo

en-US

tts_en_lj_hifigan_ft_mixertts

Mixer-TTS

LJSpeech

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_en_lj_hifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_hifigan/versions/1.6.0/files/tts_en_lj_hifigan_ft_mixertts.nemo

en-US

tts_en_lj_hifigan_ft_mixerttsx

Mixer-TTS-X

LJSpeech

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_en_lj_hifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_hifigan/versions/1.6.0/files/tts_en_lj_hifigan_ft_mixerttsx.nemo

en-US

tts_en_hifitts_hifigan_ft_fastpitch

FastPitch

HiFiTTS

44100Hz

10

nemo.collections.tts.models.hifigan.HifiGanModel

tts_en_multispeaker_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_multispeaker_fastpitchhifigan/versions/1.10.0/files/tts_en_hifitts_hifigan_ft_fastpitch.nemo

en-US

tts_en_lj_univnet

librosa.filters.mel

LJSpeech

22050Hz

1

nemo.collections.tts.models.univnet.UnivNetModel

tts_en_lj_univnet

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_univnet/versions/1.7.0/files/tts_en_lj_univnet.nemo

en-US

tts_en_libritts_univnet

librosa.filters.mel

LibriTTS

24000Hz

1

nemo.collections.tts.models.univnet.UnivNetModel

tts_en_libritts_univnet

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_libritts_univnet/versions/1.7.0/files/tts_en_libritts_multispeaker_univnet.nemo

en-US

tts_en_waveglow_88m

librosa.filters.mel

LJSpeech

22050Hz

1

nemo.collections.tts.models.waveglow.WaveGlowModel

tts_en_waveglow_88m

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_waveglow_88m/versions/1.0.0/files/tts_waveglow.nemo

de-DE

tts_de_hui_hifigan_ft_fastpitch_multispeaker_5

FastPitch

HUI Audio Corpus German

44100Hz

5

nemo.collections.tts.models.hifigan.HifiGanModel

tts_de_fastpitch_multispeaker_5

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitch_multispeaker_5/versions/1.11.0/files/tts_de_hui_hifigan_ft_fastpitch_multispeaker_5.nemo

de-DE

tts_de_hifigan_singleSpeaker_thorstenNeutral_2102

FastPitch

Thorsten Müller Neutral 21.02 dataset

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_de_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.15.0/files/tts_de_hifigan_thorstens2102.nemo

de-DE

tts_de_hifigan_singleSpeaker_thorstenNeutral_2210

FastPitch

Thorsten Müller Neutral 22.10 dataset

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_de_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.15.0/files/tts_de_hifigan_thorstens2210.nemo

es

tts_es_hifigan_ft_fastpitch_multispeaker

FastPitch

OpenSLR crowdsourced Latin American Spanish

44100Hz

174

nemo.collections.tts.models.hifigan.HifiGanModel

tts_es_multispeaker_fastpitchhifigan

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_es_multispeaker_fastpitchhifigan/versions/1.15.0/files/tts_es_hifigan_ft_fastpitch_multispeaker.nemo

zh-CN

tts_zh_hifigan_sfspeech

FastPitch

SFSpeech Chinese/English Bilingual Speech

22050Hz

1

nemo.collections.tts.models.hifigan.HifiGanModel

tts_zh_fastpitch_hifigan_sfspeech

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_zh_fastpitch_hifigan_sfspeech/versions/1.15.0/files/tts_zh_hifigan_sfspeech.nemo

End2End models#

Locale

Model Name

Dataset

Sampling Rate

#Spk

Phoneme Unit

Model Class

Overview

Checkpoint

en-US

tts_en_lj_vits

LJSpeech

22050Hz

1

IPA

nemo.collections.tts.models.vits.VitsModel

tts_en_lj_vits

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_lj_vits/versions/1.13.0/files/vits_ljspeech_fp16_full.nemo

en-US

tts_en_hifitts_vits

HiFiTTS

44100Hz

10

IPA

nemo.collections.tts.models.vits.VitsModel

tts_en_hifitts_vits

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_hifitts_vits/versions/r1.15.0/files/vits_en_hifitts.nemo

Codec models#

Model Name

Dataset

Sampling Rate

Model Class

Overview

Checkpoint

audio_codec_16khz_small

Libri-Light

16000Hz

nemo.collections.tts.models.AudioCodecModel

audio_codec_16khz_small

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/audio_codec_16khz_small/versions/v1/files/audio_codec_16khz_small.nemo

mel_codec_22khz_medium

LibriVox and Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

mel_codec_22khz_medium

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/mel_codec_22khz_medium/versions/v1/files/mel_codec_22khz_medium.nemo

mel_codec_44khz_medium

LibriVox and Common Voice

44100Hz

nemo.collections.tts.models.AudioCodecModel

mel_codec_44khz_medium

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/mel_codec_44khz_medium/versions/v1/files/mel_codec_44khz_medium.nemo

mel_codec_22khz_fullband_medium

LibriVox and Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

mel_codec_22khz_fullband_medium

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/mel_codec_22khz_fullband_medium/versions/v1/files/mel_codec_22khz_fullband_medium.nemo

mel_codec_44khz_fullband_medium

LibriVox and Common Voice

44100Hz

nemo.collections.tts.models.AudioCodecModel

mel_codec_44khz_fullband_medium

https://api.ngc.nvidia.com/v2/models/nvidia/nemo/mel_codec_44khz_fullband_medium/versions/v1/files/mel_codec_44khz_fullband_medium.nemo

nvidia/low-frame-rate-speech-codec-22khz

LibriVox and Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

audio_codec_low_frame_rate_22khz

https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz/resolve/main/low-frame-rate-speech-codec-22khz.nemo

nvidia/audio-codec-22khz

LibriVox and Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

audio-codec-22khz

https://huggingface.co/nvidia/audio-codec-22khz/resolve/main/audio-codec-22khz.nemo

nvidia/audio-codec-44khz

LibriVox and Common Voice

44100Hz

nemo.collections.tts.models.AudioCodecModel

audio-codec-44khz

https://huggingface.co/nvidia/audio-codec-44khz/resolve/main/audio-codec-44khz.nemo

nvidia/mel-codec-22khz

LibriVox and Common Voice

22050Hz

nemo.collections.tts.models.AudioCodecModel

mel-codec-22khz

https://huggingface.co/nvidia/mel-codec-22khz/resolve/main/mel-codec-22khz.nemo

nvidia/mel-codec-44khz

LibriVox and Common Voice

44100Hz

nemo.collections.tts.models.AudioCodecModel

mel-codec-44khz

https://huggingface.co/nvidia/mel-codec-44khz/resolve/main/mel-codec-44khz.nemo