Quick Start¶

Get Olla up and running with this quick start guide.

Prerequisites¶

Olla installed on your system
At least onecompatible LLM endpoint running

Configuration Examples

Olla merges your YAML file on top of built-in defaults, so you only need to specify what you want to override. The shippedconfig/config.yaml shows all available options for reference.

Basic Setup¶

1. Create Configuration¶

Create aconfig.yaml for your setup.

Configuration Best Practice

Create aconfig/config.local.yaml containing only the settings you need to change. Built-in defaults cover everything else. This file takes priority overconfig.yaml and won't be committed to version control.

$cpconfig/config.yamlconfig/config.local.yaml$viconfig/config.local.yaml# keep only the settings you need to override

See theconfiguration overview for merge behaviour details.

Here's a minimal configuration example, showing the most common changes users make:

server:host:"0.0.0.0"port:40114request_logging:trueproxy:engine:"olla"# or "sherpa" for small instancesload_balancer:"priority"discovery:type:"static"static:endpoints:-url:"http://localhost:11434"name:"local-ollama"type:"ollama"priority:100logging:level:"info"format:"json"

Settings likecheck_interval,check_timeout, andpriority are optional -- Olla provides sensible defaults for each backend type via its profile system.

The rest will be from the shipped defaults.

2. Start Olla¶

Start Olla with your configuration:

# Uses config/config.local.yaml automatically (if present)olla# Or specify a custom configolla--configmy-awesome-config.yaml

On startup, you'll see which configuration was loaded:

{"level":"INFO","msg":"Initialising","version":"v0.x.x","pid":123456}{"level":"INFO","msg":"System Configuration","isContainerised":false,...}{"level":"INFO","msg":"Loaded configuration","config":"config/config.local.yaml"}{"level":"INFO","msg":"Initialising stats collector"}...

3. Test the Proxy¶

Check that Olla is running:

curlhttp://localhost:40114/internal/health

List available models through the proxy:

# For Ollama endpointscurlhttp://localhost:40114/olla/ollama/api/tags# For OpenAI-compatible endpointscurlhttp://localhost:40114/olla/ollama/v1/models

Example Requests¶

Chat Completion (OpenAI-compatible)¶

curl-XPOSThttp://localhost:40114/olla/ollama/v1/chat/completions\-H"Content-Type: application/json"\-d'{    "model": "llama3.2",    "messages": [      {"role": "user", "content": "Hello, how are you?"}    ]  }'

Ollama Generate¶

curl-XPOSThttp://localhost:40114/olla/ollama/api/generate\-H"Content-Type: application/json"\-d'{    "model": "llama3.2",    "prompt": "Why is the sky blue?"  }'

Streaming Response¶

curl-XPOSThttp://localhost:40114/olla/ollama/v1/chat/completions\-H"Content-Type: application/json"\-d'{    "model": "llama3.2",    "messages": [      {"role": "user", "content": "Tell me a story"}    ],    "stream": true  }'

llama.cpp Endpoint¶

curl-XPOSThttp://localhost:40114/olla/llamacpp/v1/chat/completions\-H"Content-Type: application/json"\-d'{    "model": "llama-3.2-3b-instruct-q4_k_m.gguf",    "messages": [{"role": "user", "content": "Hello!"}]  }'

Multiple Endpoints Configuration¶

Configure multiple LLM endpoints with load balancing:

discovery:type:"static"static:endpoints:# High priority local Ollama-url:"http://localhost:11434"name:"local-ollama"type:"ollama"priority:100# Medium priority LM Studio-url:"http://localhost:1234"name:"local-lm-studio"type:"lm-studio"priority:50# llama.cpp endpoint-url:"http://localhost:8080"name:"local-llamacpp"type:"llamacpp"priority:95# Low priority remote endpoint-url:"https://api.example.com"name:"remote-api"type:"openai"priority:10

Monitoring¶

Monitor Olla's performance:

# Health statuscurlhttp://localhost:40114/internal/health# System status and statisticscurlhttp://localhost:40114/internal/status

Response headers provide request tracing:

curl-Ihttp://localhost:40114/olla/ollama/v1/models

Look for these headers:

X-Olla-Endpoint: Which backend handled the request
X-Olla-Backend-Type: Type of backend (ollama/openai/lm-studio)
X-Olla-Request-ID: Unique request identifier
X-Olla-Response-Time: Total processing time

Common Configuration Options¶

High-Performance Setup¶

For production environments, use the Olla engine:

proxy:engine:"olla"# High-performance engineload_balancer:"least-connections"connection_timeout:30s# Note: Automatic retry on connection failures is built-in

Rate Limiting¶

Protect your endpoints with rate limiting:

server:rate_limits:global_requests_per_minute:1000per_ip_requests_per_minute:100burst_size:50

Request Size Limits¶

Set appropriate request limits:

server:request_limits:max_body_size:52428800# 50MBmax_header_size:524288# 512KB

Learn More¶

Core Concepts¶

Proxy Engines - Compare Sherpa vs Olla engines
Load Balancing - Priority, round-robin, and least-connections strategies
Model Unification - How models are aggregated across endpoints
Health Checking - Automatic endpoint monitoring
Profile System - Customise backend behaviour

Configuration¶

Configuration Overview - Complete configuration guide
Proxy Profiles - Auto, streaming, and standard profiles
Best Practices - Production recommendations

Next Steps¶

Backend Integrations - Connect Ollama, LM Studio, llama.cpp, vLLM, SGLang, Lemonade SDK, LiteLLM
Architecture Overview - Deep dive into Olla's design
Development Guide - Contribute to Olla

Troubleshooting¶

Endpoint Not Responding¶

Check your endpoint URLs and ensure the services are running:

# Test direct access to your LLM endpointcurlhttp://localhost:11434/api/tags

Health Checks Failing¶

Verify health check URLs are correct for your endpoint type:

Ollama: Use/ or/api/version
LM Studio: Use/ or/v1/models
OpenAI-compatible: Use/v1/models

High Latency¶

Consider switching to the high-performance Olla engine:

proxy:engine:"olla"load_balancer:"least-connections"

For more detailed troubleshooting, check the logs andopen an issue if needed.

Movatterモバイル変換