Speech AI Models #

NVIDIA NeMo Framework supports the training and customization of Speech AI models, specifically designed to enable voice-based interfaces for conversational AI applications. A range of speech tasks are supported, including Automatic Speech Recognition (ASR), Speaker Diarization, and Text-to-Speech (TTS), which we highlight below.

Automatic Speech Recognition (ASR)#

Automatic Speech Recognition is the task of generating transcriptions of what was spoken in an audio file.

Latest ASR Models Developed by the NVIDIA NeMo Team#
Model family	Decoder type	Useful links
Canary	AED (Attention-based Encoder-Decoder)	Docs,Paper,HF space
Parakeet	CTC,RNN-T,TDT,TDT-CTC hybrid	Docs,HF space

Key features of NeMo ASR include:

Pretrained ASR models <https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/results.html#automatic-speech-recognition-models>__, many topping theHuggingFace Open ASR Leaderboard
Model checkpoints specialized forreal-time speech recognition
LM decoding
Keyword spotting

Find more details in theDeveloper Docs.

Speaker Diarization#

Speaker diarization is the process of partitioning an audio stream into segments based on the identity of each speaker. Essentially, it answers the question, “Who spoke when?”

Latest Speaker Diarization Models Developed by the NVIDIA NeMo Team#
Model name	Useful links
MSDD (Multiscale Diarization Decoder)	Docs,Paper,HF space

Find more details in theDeveloper Docs.

Text-To-Speech (TTS)#

Text-to-Speech is a technology that converts textual inputs into natural human speech.

Latest TTS Models Developed by the NVIDIA NeMo Team#
Model name	Useful links
T5-TTS	Paper,Blog post

Find more details in theDeveloper Docs.

Speech AI Tools#

NeMo Framework also includes a large set ofSpeech AI tools for dataset preparation, model evaluation, and text normalization.

On this page

Movatterモバイル変換

Speech AI Models#

Automatic Speech Recognition (ASR)#

Speaker Diarization#

Text-To-Speech (TTS)#

Speech AI Tools#

Speech AI Models #