- Notifications
You must be signed in to change notification settings - Fork137
Reliable model swapping for any local OpenAI/Anthropic compatible server - llama.cpp, vllm, etc
License
mostlygeek/llama-swap
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Run multiple LLM models on your machine and hot-swap between them as needed. llama-swap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications.
Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.
- ✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies
- ✅ On-demand model switching
- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc.)
- future proof, upgrade your inference servers at any time.
- ✅ OpenAI API supported endpoints:
- ✅ Anthropic API supported endpoints:
v1/messages
- ✅ llama-server (llama.cpp) supported endpoints
v1/rerank,v1/reranking,/rerank/infill- for code infilling/completion- for completion endpoint
- ✅ llama-swap API
- ✅ Customizable
llama-swap includes a real time web interface for monitoring logs and controlling models:

The Activity Page shows recent requests:

llama-swap can be installed in multiple ways
- Docker
- Homebrew (OSX and Linux)
- WinGet
- From release binaries
- From source
Docker Install (download images)
Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc.) includingnon-root variants with improved security.
$ docker pull ghcr.io/mostlygeek/llama-swap:cuda# run with a custom configuration and models directory$ docker run -it --rm --runtime nvidia -p 9292:8080 \ -v /path/to/models:/models \ -v /path/to/custom/config.yaml:/app/config.yaml \ ghcr.io/mostlygeek/llama-swap:cuda# configuration hot reload supported with a# directory volume mount$ docker run -it --rm --runtime nvidia -p 9292:8080 \ -v /path/to/models:/models \ -v /path/to/custom/config.yaml:/app/config.yaml \ -v /path/to/config:/config \ ghcr.io/mostlygeek/llama-swap:cuda -config /config/config.yaml -watch-config
more examples
# pull latest images per platformdocker pull ghcr.io/mostlygeek/llama-swap:cpudocker pull ghcr.io/mostlygeek/llama-swap:cudadocker pull ghcr.io/mostlygeek/llama-swap:vulkandocker pull ghcr.io/mostlygeek/llama-swap:inteldocker pull ghcr.io/mostlygeek/llama-swap:musa# tagged llama-swap, platform and llama-server version imagesdocker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795# non-root cudadocker pull ghcr.io/mostlygeek/llama-swap:cuda-non-root
brew tap mostlygeek/llama-swapbrew install llama-swapllama-swap --config path/to/config.yaml --listen localhost:8080
Note
WinGet is maintained by community contributorDvd-Znf (#327). It is not an official part of llama-swap.
# installC:\> winget install llama-swap# upgradeC:\> winget upgrade llama-swap
Binaries are available on therelease page for Linux, Mac, Windows and FreeBSD.
- Building requires Go and Node.js (for UI).
git clone https://github.com/mostlygeek/llama-swap.gitmake clean all- look in the
build/subdirectory for the llama-swap binary
# minimum viable config.yamlmodels:model1:cmd:llama-server --port ${PORT} --model /path/to/model.gguf
That's all you need to get started:
models- holds all model configurationsmodel1- the ID used in API callscmd- the command to run to start the server.${PORT}- an automatically assigned port number
Almost all configuration settings are optional and can be added one step at a time:
- Advanced features
groupsto run multiple models at oncehooksto run things on startupmacrosreusable snippets
- Model customization
ttlto automatically unload modelsaliasesto use familiar model names (e.g., "gpt-4o-mini")envto pass custom environment variables to inference serverscmdStopgracefully stop Docker/Podman containersuseModelNameto override model names sent to upstream servers${PORT}automatic port variables for dynamic port assignmentfiltersrewrite parts of requests before sending to the upstream server
See theconfiguration documentation for all options.
When a request is made to an OpenAI compatible endpoint, llama-swap will extract themodel value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.
In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, thegroups feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.
If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. (#236)
Recommended nginx configuration snippets:
# SSE for UI events/logslocation /api/events{proxy_passhttp://your-llama-swap-backend;proxy_buffering off;proxy_cache off;}# Streaming chat completions (stream=true)location /v1/chat/completions{proxy_passhttp://your-llama-swap-backend;proxy_buffering off;proxy_cache off;}
As a safeguard, llama-swap also setsX-Accel-Buffering: no on SSE responses. However, explicitly disablingproxy_buffering at your reverse proxy is still recommended for reliable streaming behavior.
# sends up to the last 10KB of logscurl http://host/logs'# streams combined logscurl -Ns'http://host/logs/stream'# just llama-swap's logscurl -Ns'http://host/logs/stream/proxy'# just upstream's logscurl -Ns'http://host/logs/stream/upstream'# stream and filter logs with linux pipescurl -Ns http://host/logs/stream| grep'eval time'# skips history and just streams new log entriescurl -Ns'http://host/logs/stream?no-history'
Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.
For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly toSIGTERM signals for proper shutdown.
Note
⭐️ Star this project to help others discover it!
About
Reliable model swapping for any local OpenAI/Anthropic compatible server - llama.cpp, vllm, etc
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
