Checkpoints#
There are two main ways to load pretrained checkpoints in NeMo as described inCheckpoints.
Using the
restore_from()method to load a local checkpoint file (.nemo), orUsing the
from_pretrained()method to download and set up a checkpoint from NGC.
Note that these instructions are for loading fully trained checkpoints for evaluation or fine-tuning. For resuming an unfinishedtraining experiment, use the Experiment Manager to do so by setting theresume_if_exists flag toTrue.
Local Checkpoints#
Save Model Checkpoints: NeMo automatically saves final model checkpoints with
.nemosuffix. You could also manually save any model checkpoint usingmodel.save_to(<checkpoint_path>.nemo).Load Model Checkpoints: if you’d like to load a checkpoint saved at
<path/to/checkpoint/file.nemo>, use therestore_from()method below, where<MODEL_BASE_CLASS>is the TTS model class of the original checkpoint.
importnemo.collections.ttsasnemo_ttsmodel=nemo_tts.models.<MODEL_BASE_CLASS>.restore_from(restore_path="<path/to/checkpoint/file.nemo>")
NGC Pretrained Checkpoints#
The NGCNeMo Text to Speech collection aggregates model cards that contain detailed information about checkpoints of various models trained on various datasets. The tables below inCheckpoints list part of available TTS models from NGC including speech/text aligners, acoustic models, and vocoders.
Load Model Checkpoints#
The models can be accessed via thefrom_pretrained() method inside the TTS Model class. In general, you can load any of these models with code in the following format,
importnemo.collections.ttsasnemo_ttsmodel=nemo_tts.models.<MODEL_BASE_CLASS>.from_pretrained(model_name="<MODEL_NAME>")
where<MODEL_NAME> is the value inModelName column in the tables inCheckpoints. These names are predefined in the each model’s member functionself.list_available_models(). For example, the available NGC FastPitch model names can be found,
In [1]: import nemo.collections.tts as nemo_ttsIn [2]: nemo_tts.models.FastPitchModel.list_available_models()Out[2]:[PretrainedModelInfo( pretrained_model_name=tts_en_fastpitch, description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is ARPABET-based., location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.8.1/files/tts_en_fastpitch_align.nemo, class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> ), PretrainedModelInfo( pretrained_model_name=tts_en_fastpitch_ipa, description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is IPA-based., location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/IPA_1.13.0/files/tts_en_fastpitch_align_ipa.nemo, class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> ), PretrainedModelInfo( pretrained_model_name=tts_en_fastpitch_multispeaker, description=This model is trained on HiFITTS sampled at 44100Hz with and can be used to generate male and female English voices with an American accent., location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_multispeaker_fastpitchhifigan/versions/1.10.0/files/tts_en_fastpitch_multispeaker.nemo, class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> ), PretrainedModelInfo( pretrained_model_name=tts_de_fastpitch_singlespeaker, description=This model is trained on a single male speaker data in OpenSLR Neutral German Dataset sampled at 22050Hz and can be used to generate male German voices., location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitchhifigan/versions/1.10.0/files/tts_de_fastpitch_align.nemo, class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> ), PretrainedModelInfo( pretrained_model_name=tts_de_fastpitch_multispeaker_5, description=This model is trained on 5 speakers in HUI-Audio-Corpus-German clean subset sampled at 44100Hz with and can be used to generate male and female German voices., location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_de_fastpitch_multispeaker_5/versions/1.11.0/files/tts_de_fastpitch_multispeaker_5.nemo, class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'> )]
From the above key-value pairpretrained_model_name=tts_en_fastpitch, you could get the model nametts_en_fastpitch and load it by running,
model=nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch")
If you would like to programmatically list the models available for a particular base class, you can use thelist_available_models() method,
nemo_tts.models.<MODEL_BASE_CLASS>.list_available_models()
Inference and Audio Generation#
NeMo TTS supports both cascaded and end-to-end models to synthesize audios. Most of steps in between are the same except that cascaded models need to load an extra vocoder model before generating audios. Below code snippet demonstrates steps of generating a audio sample from a text input using a cascaded FastPitch and HiFiGAN models. Please refer toNeMo TTS Collection API for detailed implementation of model classes.
importnemo.collections.ttsasnemo_tts# Load mel spectrogram generatorspec_generator=nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch")# Load vocodervocoder=nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan")# Generate audioimportsoundfileassfparsed=spec_generator.parse("You can type your sentence here to get nemo to produce speech.")spectrogram=spec_generator.generate_spectrogram(tokens=parsed)audio=vocoder.convert_spectrogram_to_audio(spec=spectrogram)# Save the audio to disk in a file called speech.wavsf.write("speech.wav",audio.to('cpu').numpy(),22050)
Fine-Tuning on Different Datasets#
There are multiple TTS tutorials provided in the directory oftutorials/tts/. Most of these tutorials demonstrate how to instantiate a pre-trained model, and prepare the model for fine-tuning on datasets with the same language or different languages, the same speaker or different speakers.
cross-lingual fine-tuning:NVIDIA/NeMo
cross-speaker fine-tuning:NVIDIA/NeMo
NGC TTS Models#
This section summarizes a full list of available NeMo TTS models that have been released inNGC NeMo Text to Speech Collection. You can download model checkpoints of your interest via either way below,
wget'<CHECKPOINT_URL_IN_THE_TABLE>'curl-LO'<CHECKPOINT_URL_IN_THE_TABLE>'
Speech/Text Aligners#
Locale | Model Name | Dataset | Sampling Rate | #Spk | Phoneme Unit | Model Class | Overview | Checkpoint |
|---|---|---|---|---|---|---|---|---|
en-US | tts_en_radtts_aligner | LJSpeech | 22050Hz | 1 | ARPABET | nemo.collections.tts.models.aligner.AlignerModel |
| |
en-US | tts_en_radtts_aligner_ipa | LJSpeech | 22050Hz | 1 | IPA | nemo.collections.tts.models.aligner.AlignerModel |
|
Mel-Spectrogram Generators#
Locale | Model Name | Dataset | Sampling Rate | #Spk | Symbols | Model Class | Overview | Checkpoint |
|---|---|---|---|---|---|---|---|---|
en-US | tts_en_fastpitch | LJSpeech | 22050Hz | 1 | ARPABET | nemo.collections.tts.models.fastpitch.FastPitchModel |
| |
en-US | tts_en_fastpitch_ipa | LJSpeech | 22050Hz | 1 | IPA | nemo.collections.tts.models.fastpitch.FastPitchModel |
| |
en-US | tts_en_fastpitch_multispeaker | HiFiTTS | 44100Hz | 10 | ARPABET | nemo.collections.tts.models.fastpitch.FastPitchModel |
| |
en-US | tts_en_lj_mixertts | LJSpeech | 22050Hz | 1 | ARPABET | nemo.collections.tts.models.mixer_tts.MixerTTSModel |
| |
en-US | tts_en_lj_mixerttsx | LJSpeech | 22050Hz | 1 | ARPABET | nemo.collections.tts.models.mixer_tts.MixerTTSModel |
| |
en-US | RAD-TTS | TBD | TBD | TBD | ARPABET | nemo.collections.tts.models.radtts.RadTTSModel | TBD | |
en-US | tts_en_tacotron2 | LJSpeech | 22050Hz | 1 | ARPABET | nemo.collections.tts.models.tacotron2.Tacotron2Model |
| |
de-DE | tts_de_fastpitch_multispeaker_5 | HUI Audio Corpus German | 44100Hz | 5 | ARPABET | nemo.collections.tts.models.fastpitch.FastPitchModel |
| |
de-DE | tts_de_fastpitch_singleSpeaker_thorstenNeutral_2102 | Thorsten Müller Neutral 21.02 dataset | 22050Hz | 1 | Graphemes | nemo.collections.tts.models.fastpitch.FastPitchModel |
| |
de-DE | tts_de_fastpitch_singleSpeaker_thorstenNeutral_2210 | Thorsten Müller Neutral 22.10 dataset | 22050Hz | 1 | Graphemes | nemo.collections.tts.models.fastpitch.FastPitchModel |
| |
es | tts_es_fastpitch_multispeaker | OpenSLR crowdsourced Latin American Spanish | 44100Hz | 174 | IPA | nemo.collections.tts.models.fastpitch.FastPitchModel |
| |
zh-CN | tts_zh_fastpitch_sfspeech | SFSpeech Chinese/English Bilingual Speech | 22050Hz | 1 | pinyin | nemo.collections.tts.models.fastpitch.FastPitchModel |
|
Vocoders#
Locale | Model Name | Spectrogram Generator | Dataset | Sampling Rate | #Spk | Model Class | Overview | Checkpoint |
|---|---|---|---|---|---|---|---|---|
en-US | tts_en_hifigan | librosa.filters.mel | LJSpeech | 22050Hz | 1 | nemo.collections.tts.models.hifigan.HifiGanModel |
| |
en-US | tts_en_lj_hifigan_ft_mixertts | Mixer-TTS | LJSpeech | 22050Hz | 1 | nemo.collections.tts.models.hifigan.HifiGanModel |
| |
en-US | tts_en_lj_hifigan_ft_mixerttsx | Mixer-TTS-X | LJSpeech | 22050Hz | 1 | nemo.collections.tts.models.hifigan.HifiGanModel |
| |
en-US | tts_en_hifitts_hifigan_ft_fastpitch | FastPitch | HiFiTTS | 44100Hz | 10 | nemo.collections.tts.models.hifigan.HifiGanModel |
| |
en-US | tts_en_lj_univnet | librosa.filters.mel | LJSpeech | 22050Hz | 1 | nemo.collections.tts.models.univnet.UnivNetModel |
| |
en-US | tts_en_libritts_univnet | librosa.filters.mel | LibriTTS | 24000Hz | 1 | nemo.collections.tts.models.univnet.UnivNetModel |
| |
en-US | tts_en_waveglow_88m | librosa.filters.mel | LJSpeech | 22050Hz | 1 | nemo.collections.tts.models.waveglow.WaveGlowModel |
| |
de-DE | tts_de_hui_hifigan_ft_fastpitch_multispeaker_5 | FastPitch | HUI Audio Corpus German | 44100Hz | 5 | nemo.collections.tts.models.hifigan.HifiGanModel |
| |
de-DE | tts_de_hifigan_singleSpeaker_thorstenNeutral_2102 | FastPitch | Thorsten Müller Neutral 21.02 dataset | 22050Hz | 1 | nemo.collections.tts.models.hifigan.HifiGanModel |
| |
de-DE | tts_de_hifigan_singleSpeaker_thorstenNeutral_2210 | FastPitch | Thorsten Müller Neutral 22.10 dataset | 22050Hz | 1 | nemo.collections.tts.models.hifigan.HifiGanModel |
| |
es | tts_es_hifigan_ft_fastpitch_multispeaker | FastPitch | OpenSLR crowdsourced Latin American Spanish | 44100Hz | 174 | nemo.collections.tts.models.hifigan.HifiGanModel |
| |
zh-CN | tts_zh_hifigan_sfspeech | FastPitch | SFSpeech Chinese/English Bilingual Speech | 22050Hz | 1 | nemo.collections.tts.models.hifigan.HifiGanModel |
|
End2End models#
Locale | Model Name | Dataset | Sampling Rate | #Spk | Phoneme Unit | Model Class | Overview | Checkpoint |
|---|---|---|---|---|---|---|---|---|
en-US | tts_en_lj_vits | LJSpeech | 22050Hz | 1 | IPA | nemo.collections.tts.models.vits.VitsModel |
| |
en-US | tts_en_hifitts_vits | HiFiTTS | 44100Hz | 10 | IPA | nemo.collections.tts.models.vits.VitsModel |
|
Codec models#
Model Name | Dataset | Sampling Rate | Model Class | Overview | Checkpoint |
|---|---|---|---|---|---|
audio_codec_16khz_small | Libri-Light | 16000Hz | nemo.collections.tts.models.AudioCodecModel |
| |
mel_codec_22khz_medium | LibriVox and Common Voice | 22050Hz | nemo.collections.tts.models.AudioCodecModel |
| |
mel_codec_44khz_medium | LibriVox and Common Voice | 44100Hz | nemo.collections.tts.models.AudioCodecModel |
| |
mel_codec_22khz_fullband_medium | LibriVox and Common Voice | 22050Hz | nemo.collections.tts.models.AudioCodecModel |
| |
mel_codec_44khz_fullband_medium | LibriVox and Common Voice | 44100Hz | nemo.collections.tts.models.AudioCodecModel |
| |
nvidia/low-frame-rate-speech-codec-22khz | LibriVox and Common Voice | 22050Hz | nemo.collections.tts.models.AudioCodecModel |
| |
nvidia/audio-codec-22khz | LibriVox and Common Voice | 22050Hz | nemo.collections.tts.models.AudioCodecModel |
| |
nvidia/audio-codec-44khz | LibriVox and Common Voice | 44100Hz | nemo.collections.tts.models.AudioCodecModel |
| |
nvidia/mel-codec-22khz | LibriVox and Common Voice | 22050Hz | nemo.collections.tts.models.AudioCodecModel |
| |
nvidia/mel-codec-44khz | LibriVox and Common Voice | 44100Hz | nemo.collections.tts.models.AudioCodecModel |
|