Speech AI Models#
NVIDIA NeMo Framework supports the training and customization of Speech AI models, specifically designed to enable voice-based interfaces for conversational AI applications. A range of speech tasks are supported, including Automatic Speech Recognition (ASR), Speaker Diarization, and Text-to-Speech (TTS), which we highlight below.
Automatic Speech Recognition (ASR)#
Automatic Speech Recognition is the task of generating transcriptions of what was spoken in an audio file.
Model family | Decoder type | Useful links |
|---|---|---|
Canary | AED (Attention-based Encoder-Decoder) | |
Parakeet |
Key features of NeMo ASR include:
Pretrained ASR models <https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/results.html#automatic-speech-recognition-models>__, many topping theHuggingFace Open ASR Leaderboard
Model checkpoints specialized forreal-time speech recognition
Find more details in theDeveloper Docs.
Speaker Diarization#
Speaker diarization is the process of partitioning an audio stream into segments based on the identity of each speaker. Essentially, it answers the question, “Who spoke when?”
Model name | Useful links |
|---|---|
MSDD (Multiscale Diarization Decoder) |
Find more details in theDeveloper Docs.
Text-To-Speech (TTS)#
Text-to-Speech is a technology that converts textual inputs into natural human speech.
Find more details in theDeveloper Docs.
Speech AI Tools#
NeMo Framework also includes a large set ofSpeech AI tools for dataset preparation, model evaluation, and text normalization.