Quick Start¶
Get Olla up and running with this quick start guide.
Prerequisites¶
- Olla installed on your system
- At least onecompatible LLM endpoint running
Configuration Examples
Olla merges your YAML file on top of built-in defaults, so you only need to specify what you want to override. The shippedconfig/config.yaml shows all available options for reference.
Basic Setup¶
1. Create Configuration¶
Create aconfig.yaml for your setup.
Configuration Best Practice
Create aconfig/config.local.yaml containing only the settings you need to change. Built-in defaults cover everything else. This file takes priority overconfig.yaml and won't be committed to version control.
$cpconfig/config.yamlconfig/config.local.yaml$viconfig/config.local.yaml# keep only the settings you need to overrideSee theconfiguration overview for merge behaviour details.
Here's a minimal configuration example, showing the most common changes users make:
server:host:"0.0.0.0"port:40114request_logging:trueproxy:engine:"olla"# or "sherpa" for small instancesload_balancer:"priority"discovery:type:"static"static:endpoints:-url:"http://localhost:11434"name:"local-ollama"type:"ollama"priority:100logging:level:"info"format:"json"Settings likecheck_interval,check_timeout, andpriority are optional -- Olla provides sensible defaults for each backend type via its profile system.
The rest will be from the shipped defaults.
2. Start Olla¶
Start Olla with your configuration:
# Uses config/config.local.yaml automatically (if present)olla# Or specify a custom configolla--configmy-awesome-config.yamlOn startup, you'll see which configuration was loaded:
{"level":"INFO","msg":"Initialising","version":"v0.x.x","pid":123456}{"level":"INFO","msg":"System Configuration","isContainerised":false,...}{"level":"INFO","msg":"Loaded configuration","config":"config/config.local.yaml"}{"level":"INFO","msg":"Initialising stats collector"}...3. Test the Proxy¶
Check that Olla is running:
List available models through the proxy:
# For Ollama endpointscurlhttp://localhost:40114/olla/ollama/api/tags# For OpenAI-compatible endpointscurlhttp://localhost:40114/olla/ollama/v1/modelsExample Requests¶
Chat Completion (OpenAI-compatible)¶
curl-XPOSThttp://localhost:40114/olla/ollama/v1/chat/completions\-H"Content-Type: application/json"\-d'{ "model": "llama3.2", "messages": [ {"role": "user", "content": "Hello, how are you?"} ] }'Ollama Generate¶
curl-XPOSThttp://localhost:40114/olla/ollama/api/generate\-H"Content-Type: application/json"\-d'{ "model": "llama3.2", "prompt": "Why is the sky blue?" }'Streaming Response¶
curl-XPOSThttp://localhost:40114/olla/ollama/v1/chat/completions\-H"Content-Type: application/json"\-d'{ "model": "llama3.2", "messages": [ {"role": "user", "content": "Tell me a story"} ], "stream": true }'llama.cpp Endpoint¶
curl-XPOSThttp://localhost:40114/olla/llamacpp/v1/chat/completions\-H"Content-Type: application/json"\-d'{ "model": "llama-3.2-3b-instruct-q4_k_m.gguf", "messages": [{"role": "user", "content": "Hello!"}] }'Multiple Endpoints Configuration¶
Configure multiple LLM endpoints with load balancing:
discovery:type:"static"static:endpoints:# High priority local Ollama-url:"http://localhost:11434"name:"local-ollama"type:"ollama"priority:100# Medium priority LM Studio-url:"http://localhost:1234"name:"local-lm-studio"type:"lm-studio"priority:50# llama.cpp endpoint-url:"http://localhost:8080"name:"local-llamacpp"type:"llamacpp"priority:95# Low priority remote endpoint-url:"https://api.example.com"name:"remote-api"type:"openai"priority:10Monitoring¶
Monitor Olla's performance:
# Health statuscurlhttp://localhost:40114/internal/health# System status and statisticscurlhttp://localhost:40114/internal/statusResponse headers provide request tracing:
Look for these headers:
X-Olla-Endpoint: Which backend handled the requestX-Olla-Backend-Type: Type of backend (ollama/openai/lm-studio)X-Olla-Request-ID: Unique request identifierX-Olla-Response-Time: Total processing time
Common Configuration Options¶
High-Performance Setup¶
For production environments, use the Olla engine:
proxy:engine:"olla"# High-performance engineload_balancer:"least-connections"connection_timeout:30s# Note: Automatic retry on connection failures is built-inRate Limiting¶
Protect your endpoints with rate limiting:
Request Size Limits¶
Set appropriate request limits:
Learn More¶
Core Concepts¶
- Proxy Engines - Compare Sherpa vs Olla engines
- Load Balancing - Priority, round-robin, and least-connections strategies
- Model Unification - How models are aggregated across endpoints
- Health Checking - Automatic endpoint monitoring
- Profile System - Customise backend behaviour
Configuration¶
- Configuration Overview - Complete configuration guide
- Proxy Profiles - Auto, streaming, and standard profiles
- Best Practices - Production recommendations
Next Steps¶
- Backend Integrations - Connect Ollama, LM Studio, llama.cpp, vLLM, SGLang, Lemonade SDK, LiteLLM
- Architecture Overview - Deep dive into Olla's design
- Development Guide - Contribute to Olla
Troubleshooting¶
Endpoint Not Responding¶
Check your endpoint URLs and ensure the services are running:
Health Checks Failing¶
Verify health check URLs are correct for your endpoint type:
- Ollama: Use
/or/api/version - LM Studio: Use
/or/v1/models - OpenAI-compatible: Use
/v1/models
High Latency¶
Consider switching to the high-performance Olla engine:
For more detailed troubleshooting, check the logs andopen an issue if needed.