fixie-ai/ultravoxPublic

NotificationsYou must be signed in to change notification settings
Fork320
Star4.1k

A fast multimodal LLM for real-time voice

License

MIT license

4.1k stars 320 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.github/workflows		.github/workflows
docs/assets		docs/assets
scripts		scripts
ultravox		ultravox
.gitattributes		.gitattributes
.gitignore		.gitignore
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
mcloud_eval.yaml		mcloud_eval.yaml
mcloud_oaieval.yaml		mcloud_oaieval.yaml
mcloud_train.yaml		mcloud_train.yaml
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.sh		setup.sh

Repository files navigation

A fast multimodal LLM designed for real-time voice interactions

Latest News

2025/06 —Ultravox 0.6 available
2025/02 —Ultravox 0.5 available
2024/11 —Ultravox 0.4.1 available
2024/08 —Ultravox 0.4 available
2024/08 —Ultravox 0.3 available
2024/08 — Preview of Ultravox APIs available, more informationhere

Key Links

Ultravox Realtime — Build real-time Voice AI agents on top of the Ultravox model
Hugging Face — Our Hugging Face page

About

Ultravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate Audio Speech Recognition (ASR) stage. Building on research likeAudioLM,SeamlessM4T,Gazelle,SpeechGPT, and others, Ultravox is able to extend any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLM. We've trained versions on Llama 3, Mistral, and Gemma. This direct coupling allows Ultravox to respond much more quickly than systems that combine separate ASR and LLM components. In the future this will also allow Ultravox to natively understand the paralinguistic cues of timing and emotion that are omnipresent in human speech.

Ultravox currently takes in audio and emits streaming text. As we evolve the model, we'll train it to be able to emit a stream of speech tokens that can then be converted directly into raw audio by an appropriate unit vocoder.

Our default model is built on top of Llama 3.3 70B. We also have an 8B variant available on Hugging Face.

Ultravox can be trained against any open-weight model. See below for more details on training.

Demo

See Ultravox in action on ourdemo page. You can build your own voice-to-voice agents on our Realtime platform at ultravox.ai.

Discord

Join us on our Discord serverhere.

Jobs

If you're interested in working on Ultravox fulltime, we're hiring! Check out our jobs pagehere.

Inference Server

You can try out Ultravox using your own audio content (as a WAV file) by spinning up an Ultravox instance on our partner, BaseTen:https://www.baseten.co/library/ultravox/. They offer free credits to get started.

If you're interested in running Ultravox in a real-time capacity, we offer a set of managed APIs as well. You can learn more about getting access to thosehere.

Model

You can download the latest weights from theUltravox Hugging Face page.

Architecture

Contributing

Read on if you're interested in training your own version of Ultravox.

Environment Setup (Mac)

Install the basic tools:

Homebrew is a package manager for MacOS that also mostly works for Linux. If you're running Debian or Ubuntu Linux, you can alternatively get by with apt.
Just simplifies our shell workflows. It frequently functions as our interface to all the other tools.

/bin/bash -c"$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"brew updatebrew install just

It's recommended to use pyenv for managing environments due to the use of Poetry:

brew install xzbrew install pyenvpyenv initpyenv install 3.11pyenv global 3.11# Optionalpyenv shell 3.11

Note: Use of conda is NOT recommended with Poetry

After creating a virtual environment, install required packages usingjust andpoetry:

just install

We're using Poetry to manage the Python virtual environment. You can observe your environment withpoetry env info.

Mosaic Environment Setup (Fixie Internal)

If you want to useMosaic for training, you need to setup a few things to run on the Mosaic Platform.

Install & login to the Mosaic CLI

pip install --upgrade mosaicml-climcli initmcliset api-key<new-value>

set API keys for tools we use:

# Huggging Face token for accessing walled data and modelsmcli create secret env HF_TOKEN=hf_<your_token>mcli create secret env HF_WRITE_TOKEN=hf_<your_token_with_write_access># WandB token for logging experimentsmcli create secret env WANDB_PROJECT=ultravoxmcli create secret env WANDB_API_KEY=<your_wandb_key>

Training

Currently, we keep both the LLM and the audio encoder frozen and only train the adapter/projector. Training Ultraox v0.4 took 2-3 hours on 8xH100 GPUs for 14K training steps.

Use-Cases for Training Ultravox

Why would you want to (re-) train Ultravox? Here are a few scenarios:

You want to use a different LLM or audio encoder backbone.
a. In this case you need to re-train the adapter. You can userelease_config.yaml, which contains our config for our latest release, and you should be able to simply change the base LLM or encoder by specifying--text-model <hf-model-id-for-llm> and/or--audio-model <hf-model-id-for-encoder>.
You want to improve the knowledge of the model
a. We suggest to either use RAG on the fly (no training needed), or fine-tune the LLM backbone instead. Fine-tuning the LLM backbone does not require re-training Ultravox (i.e., the existing adapter will work).
You want to use your own audio data, for example to add support for a new language.
a. First step, prepare your dataset: at bare minimum, the samples should have anaudio and a textcontinuation field.
b. Take a look atds_tool.py andcontinuation.jinja as well asour variant of Common Voice that was created usingds_tool to add thecontinuation field.
c. Add your dataset to the dataset mix inrelease_config.yaml and train.

There's no one-size fits all. If you need help you can find us on our Discord serverhere.

How to Train

We do most of our training on theMosaicML platform and therefore most of our tooling and docs are Mosaic-related. However, you can do the same training on your own GPU without much difficulty. Here we assume you have the environment set up (runjust install). You can also take a look atsetup.sh

To kick off a training run you can do:

poetry run python -m ultravox.training.train --config_path ultravox/training/configs/release_config.yaml

For DDP training make sure to addtorchrun. We also recommend prefetching weights in advance:

TRAIN_ARGS="--config_path ultravox/training/configs/release_config.yaml"poetry run python -m ultravox.training.helpers.prefetch_weights$TRAIN_ARGSpoetry run torchrun --nproc_per_node=8 -m ultravox.training.train$TRAIN_ARGS

For a debug run, you can use smaller models, datasets, or batch size. Here's a config that uses TinyLlama as the LLM backbone:

poetry run python -m ultravox.training.train --config_path ultravox/training/configs/asr_tinyllama_100s.yaml --batch_size 1 --report_logs_to tensorboard

We useSimpleParsing for configs. Configs are composable (i.e. you can specify zero or many configs) andmeta_config.yaml is always used as the default.Seeconfigs_base.py to find the parameters you modify, such as the--text-model,--device,--exp-name, etc.

MosaicML Training (Fixie Internal)

Before running any training jobs, set upSSH authentication with MosaicML:

Generate an SSH key:
```
ssh-keygen -f~/.ssh/mclid_id_rsa
```
Add the public key to your GitHub account
Upload the private key to MosaicML (this allows MosaicML to clone the repository and run jobs):
```
mcli create secret git-ssh~/.ssh/mclid_id_rsa
```

Then you can run the following command to kick off a training job:

mcli run -f mcloud_train.yaml --follow

Other useful commands:

mcli get clustersmcli util r7z2mcli get runsmcli get runs --cluster r7z2mcli run -f mcloud_eval.yaml --follow

For interactive runs you can use:

just mcloud --image mosaicml/composer:latest --max-duration 1

IMPORTANT: Make sure to monitor your jobs and stop the machine when you're done with any job, specially interactive ones!

Running evaluations

For inference or evaluations, you can use:

justeval --config_path ultravox/evaluation/configs/eval_config.yaml

whereeval_config.yaml is a config file that specifies the model, datasets, and configurations to use for inference or evaluation. If your dataset is not already defined in ultravox, you need to create a config file for your dataset inultravox/data/configs/ (with the appropriateeval_config field to specify evaluation metrics and arguments), and register it inultravox/data/registry.py. Please refer to examples inultravox/data/configs/.

Misc

TheJustfile is a good resource for finding popular commands. Here are a few:

just update# update dependenciesjust format# run formatting (black, isort, autoflake)justtest# run testsjust python# activate venv and run python

About

A fast multimodal LLM for real-time voice

ultravox.ai

Releases4

v0.5 Latest

Feb 11, 2025

+ 3 releases

Packages

No packages published

Movatterモバイル変換

License

fixie-ai/ultravox

Folders and files

Latest commit

History

Repository files navigation

A fast multimodal LLM designed for real-time voice interactions

About

Demo

Discord

Jobs

Inference Server

Model

Architecture

Contributing

Environment Setup (Mac)

Mosaic Environment Setup (Fixie Internal)

Training

Use-Cases for Training Ultravox

How to Train

MosaicML Training (Fixie Internal)

Running evaluations

Misc

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases4

Packages0

Uh oh!

Contributors14

Languages

Packages