Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

High-performance lightweight proxy and load balancer for LLM infrastructure. Intelligent routing, automatic failover and unified model discovery across local and remote inference backends.

License

NotificationsYou must be signed in to change notification settings

thushan/olla

Repository files navigation

Olla - Smart LLM Load Balancer & Proxy

LicenseGoCIGo Report CardLatest Release
llama.cpp: Native SupportvLLM: Native SupportSGLang: Native SupportLiteLLM: Native SupportLM Deploy: OpenAI Compatible
vLLM-MLX: Native SupportDocker Model Runner: Native Support
Ollama: Native SupportLM Studio: Native SupportLemonadeSDK: Native Support


Recorded withVHS - seedemo tape

Documentation  Issues  Releases

Important

Olla is currentlyin active-development. While it is usable, we are still finalising some features and optimisations.Your feedback is invaluable! Openan issue and let us know features you'd like to see in the future.

Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with awide variety of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.

Olla works alongside API gateways likeLiteLLM or orchestration platforms likeGPUStack, focusing on making yourexisting LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines:Sherpa for simplicity and maintainability orOlla for maximum performance with advanced features like circuit breakers and connection pooling.

Olla Single OpenAI

Single CLI application and config file is all you need to go Olla!

Key Features

Platform Support

Olla runs on multiple platforms and architectures:

PlatformAMD64ARM64Notes
LinuxFull support including Raspberry Pi 4+
macOSIntel and Apple Silicon (M1/M2/M3/M4)
WindowsWindows 10/11 and Windows on ARM
DockerMulti-architecture images (amd64/arm64)

Quick Start

Installation

# Download latest release (auto-detects your platform)bash<(curl -s https://raw.githubusercontent.com/thushan/olla/main/install.sh)
# Docker (automatically pulls correct architecture)docker run -t \  --name olla \  -p 40114:40114 \  ghcr.io/thushan/olla:latest# Or explicitly specify platform (e.g., for ARM64)docker run --platform linux/arm64 -t \  --name olla \  -p 40114:40114 \  ghcr.io/thushan/olla:latest
# Install via Gogo install github.com/thushan/olla@latest
# Build from sourcegit clone https://github.com/thushan/olla.git&&cd olla&& make build-release# Run Olla./bin/olla

Verification

When you have everything running, you can check it's all working with:

# Check health of Ollacurl http://localhost:40114/internal/health# Check endpointscurl http://localhost:40114/internal/status/endpoints# Check models availablecurl http://localhost:40114/internal/status/models

For detailed installation and deployment options, seeGetting Started Guide.

Querying Olla

Olla exposes multiple API paths depending on your use case:

PathFormatUse Case
/olla/proxy/OpenAIRoutes to any backend — universal endpoint
/olla/anthropic/AnthropicClaude-compatible clients (passthrough or translated)
/olla/{provider}/OpenAITarget a specific backend type (e.g./olla/vllm/,/olla/ollama/)

OpenAI-Compatible (Universal Proxy)

# Chat completion (routes to best available backend)curl http://localhost:40114/olla/proxy/v1/chat/completions \  -H"Content-Type: application/json" \  -d'{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}'# Streamingcurl http://localhost:40114/olla/proxy/v1/chat/completions \  -H"Content-Type: application/json" \  -d'{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100, "stream": true}'# List all models across backendscurl http://localhost:40114/olla/proxy/v1/models

Anthropic Messages API

# Chat completion (passthrough for supported backends, translated for others)curl http://localhost:40114/olla/anthropic/v1/messages \  -H"Content-Type: application/json" \  -H"x-api-key: not-needed" \  -H"anthropic-version: 2023-06-01" \  -d'{"model": "llama3.2", "max_tokens": 100, "messages": [{"role": "user", "content": "Hello!"}]}'# Streamingcurl http://localhost:40114/olla/anthropic/v1/messages \  -H"Content-Type: application/json" \  -H"x-api-key: not-needed" \  -H"anthropic-version: 2023-06-01" \  -d'{"model": "llama3.2", "max_tokens": 100, "messages": [{"role": "user", "content": "Hello!"}], "stream": true}'

Provider-Specific Endpoints

# Target a specific backend type directlycurl http://localhost:40114/olla/ollama/v1/chat/completions \  -H"Content-Type: application/json" \  -d'{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}'# Other providers: /olla/vllm/, /olla/vllm-mlx/, /olla/lm-studio/, /olla/llamacpp/, etc.

Examples

We've also got ready-to-use Docker Compose setups for common scenarios:

Common Architectures

  • Home Lab: Olla → Multiple Ollama (or OpenAI Compatible - eg. vLLM) instances across your machines
  • Hybrid Cloud: Olla → Local endpoints + LiteLLM → Cloud APIs (OpenAI, Anthropic, Bedrock, etc.)
  • Enterprise: Olla → GPUStack cluster + vLLM servers + LiteLLM (cloud overflow)
  • Development: Olla → Local + Shared team endpoints + LiteLLM (API access)

Seeintegration patterns for detailed architectures.

🌐OpenWebUI Integration

Complete setup withOpenWebUI + Olla load balancing multipleOllama instances or unify allOpenAI compatible models.

  • See:examples/ollama-openwebui/
  • Services: OpenWebUI (web UI) + Olla (proxy/load balancer)
  • Use Case: Web interface with intelligent load balancing across multiple Ollama servers with Olla
  • Quick Start:
    cd examples/ollama-openwebui# Edit olla.yaml to configure your Ollama endpointsdocker compose up -d# Access OpenWebUI at http://localhost:3000

You can learn more aboutOpenWebUI Ollama with Olla or seeOpenWebUI OpenAI with Olla.

🤖Anthropic Message API / CLI Tools - Claude Code, OpenCode, Crush

Olla's Anthropic Messages API translation (v0.0.20+) isenabled by default, allowing you to use CLI tools like Claude Code with local AI models on your machine via/olla/anthropic. Still actively being improved -- please report any issues or feedback.

We have examples for:

Learn more aboutAnthropic API Translation.

Documentation

Full documentation is available athttps://thushan.github.io/olla/

🤝 Contributing

We welcome contributions! Please open an issue first to discuss major changes.

🤖 AI Disclosure

This project has been built with the assistance of AI tools for documentation, test refinement, and code reviews.

We've utilised GitHub Copilot, Anthropic Claude, Jetbrains Junie and OpenAI ChatGPT for documentation, code reviews, test refinement and troubleshooting.

🙏 Acknowledgements

📄 License

Licensed under the Apache License 2.0. SeeLICENSE for details.

🎯 Roadmap

  • Circuit breakers: Advanced fault tolerance (Olla engine)
  • Connection pooling: Per-endpoint connection management (Olla engine)
  • Object pooling: Reduced GC pressure for high throughput (Olla engine)
  • Model routing: Route based on model requested
  • Authenticated Endpoints: Support calling authenticated endpoints (bearer) like OpenAI/Groq/OpenRouter as endpoints
  • Auto endpoint discovery: Add endpoints, let Olla determine the type
  • Model benchmarking: Benchmark models across multiple endpoints easily
  • Metrics export: Prometheus/OpenTelemetry integration
  • Dynamic configuration: API-driven endpoint management
  • TLS termination: Built-in SSL support
  • Olla Admin Panel: View Olla metrics easily within the browser
  • Model caching: Intelligent model preloading
  • Advanced Connection Management: Authenticated endpoints (via SSH tunnels, OAuth, Tokens)
  • OpenRouter Support: Support OpenRouter calls within Olla (divert to free models on OpenRouter etc)

Let us know what you want to see!


Made with ❤️ for the LLM community

🏠 Homepage📖 Documentation🐛 Issues🚀 Releases

About

High-performance lightweight proxy and load balancer for LLM infrastructure. Intelligent routing, automatic failover and unified model discovery across local and remote inference backends.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

  •  

Packages

 
 
 

Contributors4

  •  
  •  
  •  
  •  

Languages


[8]ページ先頭

©2009-2026 Movatter.jp