Serving LLMs #

Ray Serve LLM APIs allow users to deploy multiple LLM models together with a familiar Ray Serve API, while providing compatibility with the OpenAI API.

Features#

⚡️ Automatic scaling and load balancing
🌐 Unified multi-node multi-model deployment
🔌 OpenAI compatible
🔄 Multi-LoRA support with shared base models
🚀 Engine agnostic architecture (i.e. vLLM, SGLang, etc)

Requirements#

pipinstallray[serve,llm]>=2.43.0vllm>=0.7.2# Suggested dependencies when using vllm 0.7.2:pipinstallxgrammar==0.1.11pynvml==12.0.0

Key Components#

Theray.serve.llm module provides two key deployment types for serving LLMs:

LLMServer#

The LLMServer sets up and manages the vLLM engine for model serving. It can be used standalone or combined with your own custom Ray Serve deployments.

LLMRouter#

This deployment provides an OpenAI-compatible FastAPI ingress and routes traffic to the appropriate model for multi-model services. The following endpoints are supported:

/v1/chat/completions: Chat interface (ChatGPT-style)
/v1/completions: Text completion
/v1/embeddings: Text embeddings
/v1/models: List available models
/v1/models/{model}: Model information

Configuration#

LLMConfig#

TheLLMConfig class specifies model details such as:

Model loading sources (HuggingFace or cloud storage)
Hardware requirements (accelerator type)
Engine arguments (e.g. vLLM engine kwargs)
LoRA multiplexing configuration
Serve auto-scaling parameters

Quickstart Examples#

Deployment through`LLMRouter`#

Builder Pattern

fromrayimportservefromray.serve.llmimportLLMConfig,build_openai_appllm_config=LLMConfig(model_loading_config=dict(model_id="qwen-0.5b",model_source="Qwen/Qwen2.5-0.5B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),# Pass the desired accelerator type (e.g. A10G, L4, etc.)accelerator_type="A10G",# You can customize the engine arguments (e.g. vLLM engine kwargs)engine_kwargs=dict(tensor_parallel_size=2,),)app=build_openai_app({"llm_configs":[llm_config]})serve.run(app,blocking=True)

Bind Pattern

fromrayimportservefromray.serve.llmimportLLMConfig,LLMServer,LLMRouterllm_config=LLMConfig(model_loading_config=dict(model_id="qwen-0.5b",model_source="Qwen/Qwen2.5-0.5B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),# Pass the desired accelerator type (e.g. A10G, L4, etc.)accelerator_type="A10G",# You can customize the engine arguments (e.g. vLLM engine kwargs)engine_kwargs=dict(tensor_parallel_size=2,),)# Deploy the applicationdeployment=LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)llm_app=LLMRouter.as_deployment().bind([deployment])serve.run(llm_app,blocking=True)

You can query the deployed models using either cURL or the OpenAI Python client:

cURL

curl-XPOSThttp://localhost:8000/v1/chat/completions\-H"Content-Type: application/json"\-H"Authorization: Bearer fake-key"\-d'{           "model": "qwen-0.5b",           "messages": [{"role": "user", "content": "Hello!"}]         }'

Python

fromopenaiimportOpenAI# Initialize clientclient=OpenAI(base_url="http://localhost:8000/v1",api_key="fake-key")# Basic chat completion with streamingresponse=client.chat.completions.create(model="qwen-0.5b",messages=[{"role":"user","content":"Hello!"}],stream=True)forchunkinresponse:ifchunk.choices[0].delta.contentisnotNone:print(chunk.choices[0].delta.content,end="",flush=True)

For deploying multiple models, you can pass a list ofLLMConfig objects to theLLMRouter deployment:

Builder Pattern

fromrayimportservefromray.serve.llmimportLLMConfig,build_openai_appllm_config1=LLMConfig(model_loading_config=dict(model_id="qwen-0.5b",model_source="Qwen/Qwen2.5-0.5B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),accelerator_type="A10G",)llm_config2=LLMConfig(model_loading_config=dict(model_id="qwen-1.5b",model_source="Qwen/Qwen2.5-1.5B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),accelerator_type="A10G",)app=build_openai_app({"llm_configs":[llm_config1,llm_config2]})serve.run(app,blocking=True)

Bind Pattern

fromrayimportservefromray.serve.llmimportLLMConfig,LLMServer,LLMRouterllm_config1=LLMConfig(model_loading_config=dict(model_id="qwen-0.5b",model_source="Qwen/Qwen2.5-0.5B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),accelerator_type="A10G",)llm_config2=LLMConfig(model_loading_config=dict(model_id="qwen-1.5b",model_source="Qwen/Qwen2.5-1.5B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),accelerator_type="A10G",)# Deploy the applicationdeployment1=LLMServer.as_deployment(llm_config1.get_serve_options(name_prefix="vLLM:")).bind(llm_config1)deployment2=LLMServer.as_deployment(llm_config2.get_serve_options(name_prefix="vLLM:")).bind(llm_config2)llm_app=LLMRouter.as_deployment().bind([deployment1,deployment2])serve.run(llm_app,blocking=True)

See alsoServe DeepSeek for an example of deploying DeepSeek models.

Production Deployment#

For production deployments, Ray Serve LLM provides utilities for config-driven deployments. You can specify your deployment configuration using YAML files:

Inline Config

# config.yamlapplications:-args:llm_configs:-model_loading_config:model_id:qwen-0.5bmodel_source:Qwen/Qwen2.5-0.5B-Instructaccelerator_type:A10Gdeployment_config:autoscaling_config:min_replicas:1max_replicas:2-model_loading_config:model_id:qwen-1.5bmodel_source:Qwen/Qwen2.5-1.5B-Instructaccelerator_type:A10Gdeployment_config:autoscaling_config:min_replicas:1max_replicas:2import_path:ray.serve.llm:build_openai_appname:llm_approute_prefix:"/"

Standalone Config

# config.yamlapplications:-args:llm_configs:-models/qwen-0.5b.yaml-models/qwen-1.5b.yamlimport_path:ray.serve.llm:build_openai_appname:llm_approute_prefix:"/"

# models/qwen-0.5b.yamlmodel_loading_config:model_id:qwen-0.5bmodel_source:Qwen/Qwen2.5-0.5B-Instructaccelerator_type:A10Gdeployment_config:autoscaling_config:min_replicas:1max_replicas:2

# models/qwen-1.5b.yamlmodel_loading_config:model_id:qwen-1.5bmodel_source:Qwen/Qwen2.5-1.5B-Instructaccelerator_type:A10Gdeployment_config:autoscaling_config:min_replicas:1max_replicas:2

To deploy using either configuration file:

serverunconfig.yaml

Generate config files#

Ray Serve LLM provides a CLI to generate config files for your deployment:

python-mray.serve.llm.gen_config

Note: This command requires interactive inputs. You should execute it directly in theterminal.

This command lets you pick from a common set of OSS LLMs and helps you configure them.You can tune settings like GPU type, tensor parallelism, and autoscaling parameters.

Note that if you’re configuring a model whose architecture is different from theprovided list of models, you should closely review the generated model config file toprovide the correct values.

This command generates two files: an LLM config file, saved inmodel_config/, and aRay Serve config file,serve_TIMESTAMP.yaml, that you can reference and re-run in thefuture.

After reading and reviewing the generated model config, seethevLLM engine configuration docsfor further customization.

Observability#

Ray enables LLM service-level logging by default, and makes these statistics available using Grafana and Prometheus. For more details on configuring Grafana and Prometheus, seeCollecting and monitoring metrics.

These higher-level metrics track request and token behavior across deployed models. For example: average total tokens per request, ratio of input tokens to generated tokens, and peak tokens per second.

For visualization, Ray ships with a Serve LLM-specific dashboard, which is automatically available in Grafana. Example below:

Engine Metrics#

All engine metrics, including vLLM, are available through the Ray metrics export endpoint and are queryable using Prometheus. SeevLLM metrics for a complete list. These are also visualized by the Serve LLM Grafana dashboard. Dashboard panels include: time per output token (TPOT), time to first token (TTFT), and GPU cache utilization.

Engine metric logging is off by default, and must be manually enabled. In addition, you must enable the vLLM V1 engine to use engine metrics. To enable engine-level metric logging, setlog_engine_metrics:True when configuring the LLM deployment. For example:

Python

fromrayimportservefromray.serve.llmimportLLMConfig,build_openai_appllm_config=LLMConfig(model_loading_config=dict(model_id="qwen-0.5b",model_source="Qwen/Qwen2.5-0.5B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),log_engine_metrics=True)app=build_openai_app({"llm_configs":[llm_config]})serve.run(app,blocking=True)

YAML

# config.yamlapplications:-args:llm_configs:-model_loading_config:model_id:qwen-0.5bmodel_source:Qwen/Qwen2.5-0.5B-Instructaccelerator_type:A10Gdeployment_config:autoscaling_config:min_replicas:1max_replicas:2log_engine_metrics:trueimport_path:ray.serve.llm:build_openai_appname:llm_approute_prefix:"/"

Advanced Usage Patterns#

For each usage pattern, we provide a server and client code snippet.

Multi-LoRA Deployment#

You can use LoRA (Low-Rank Adaptation) to efficiently fine-tune models by configuring theLoraConfig.We use Ray Serve’s multiplexing feature to serve multiple LoRA checkpoints from the same model.This allows the weights to be loaded on each replica on-the-fly and be cached via an LRU mechanism.

Server

fromrayimportservefromray.serve.llmimportLLMConfig,build_openai_app# Configure the model with LoRAllm_config=LLMConfig(model_loading_config=dict(model_id="qwen-0.5b",model_source="Qwen/Qwen2.5-0.5B-Instruct",),lora_config=dict(# Let's pretend this is where LoRA weights are stored on S3.# For example# s3://my_dynamic_lora_path/lora_model_1_ckpt# s3://my_dynamic_lora_path/lora_model_2_ckpt# are two of the LoRA checkpointsdynamic_lora_loading_path="s3://my_dynamic_lora_path",max_num_adapters_per_replica=16,),engine_kwargs=dict(enable_lora=True,),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),accelerator_type="A10G",)# Build and deploy the modelapp=build_openai_app({"llm_configs":[llm_config]})serve.run(app,blocking=True)

Client

fromopenaiimportOpenAI# Initialize clientclient=OpenAI(base_url="http://localhost:8000/v1",api_key="fake-key")# Make a request to the desired lora checkpointresponse=client.chat.completions.create(model="qwen-0.5b:lora_model_1_ckpt",messages=[{"role":"user","content":"Hello!"}],stream=True,)forchunkinresponse:ifchunk.choices[0].delta.contentisnotNone:print(chunk.choices[0].delta.content,end="",flush=True)

Embeddings#

You can generate embeddings by selecting the embed task in the engine arguments.Models supporting this use case are listed atvLLM text embedding models.

Server

fromrayimportservefromray.serve.llmimportLLMConfig,build_openai_appllm_config=LLMConfig(model_loading_config=dict(model_id="qwen-0.5b",model_source="Qwen/Qwen2.5-0.5B-Instruct",),engine_kwargs=dict(task="embed",),)app=build_openai_app({"llm_configs":[llm_config]})serve.run(app,blocking=True)

Python Client

fromopenaiimportOpenAI# Initialize clientclient=OpenAI(base_url="http://localhost:8000/v1",api_key="fake-key")# Make a request to the desired lora checkpointresponse=client.embeddings.create(model="qwen-0.5b",input=["A text to embed","Another text to embed"],)fordatainresponses.data:print(data.embedding)# List of float of len 4096

cURL

curl-XPOSThttp://localhost:8000/v1/embeddings\-H"Content-Type: application/json"\-H"Authorization: Bearer fake-key"\-d'{           "model": "qwen-0.5b",           "input": ["A text to embed", "Another text to embed"],           "encoding_format": "float"         }'

Structured Output#

For structured output, you can use JSON mode similar to OpenAI’s API:

Server

fromrayimportservefromray.serve.llmimportLLMConfig,build_openai_appllm_config=LLMConfig(model_loading_config=dict(model_id="qwen-0.5b",model_source="Qwen/Qwen2.5-0.5B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),accelerator_type="A10G",)# Build and deploy the modelapp=build_openai_app({"llm_configs":[llm_config]})serve.run(app,blocking=True)

Client (JSON Object)

fromopenaiimportOpenAI# Initialize clientclient=OpenAI(base_url="http://localhost:8000/v1",api_key="fake-key")# Request structured JSON outputresponse=client.chat.completions.create(model="qwen-0.5b",response_format={"type":"json_object"},messages=[{"role":"system","content":"You are a helpful assistant that outputs JSON."},{"role":"user","content":"List three colors in JSON format"}],stream=True,)forchunkinresponse:ifchunk.choices[0].delta.contentisnotNone:print(chunk.choices[0].delta.content,end="",flush=True)# Example response:# {#   "colors": [#     "red",#     "blue",#     "green"#   ]# }

Client (JSON Schema)

If you want, you can also specify the schema you want for the response, using pydantic models:

fromopenaiimportOpenAIfromtypingimportList,LiteralfrompydanticimportBaseModel# Initialize clientclient=OpenAI(base_url="http://localhost:8000/v1",api_key="fake-key")# Define a pydantic model of a preset of allowed colorsclassColor(BaseModel):colors:List[Literal["cyan","magenta","yellow"]]# Request structured JSON outputresponse=client.chat.completions.create(model="qwen-0.5b",response_format={"type":"json_schema","json_schema":Color.model_json_schema()},messages=[{"role":"system","content":"You are a helpful assistant that outputs JSON."},{"role":"user","content":"List three colors in JSON format"}],stream=True,)forchunkinresponse:ifchunk.choices[0].delta.contentisnotNone:print(chunk.choices[0].delta.content,end="",flush=True)# Example response:# {#   "colors": [#     "cyan",#     "magenta",#     "yellow"#   ]# }

Vision Language Models#

For multimodal models that can process both text and images:

Server

fromrayimportservefromray.serve.llmimportLLMConfig,build_openai_app# Configure a vision modelllm_config=LLMConfig(model_loading_config=dict(model_id="pixtral-12b",model_source="mistral-community/pixtral-12b",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),accelerator_type="L40S",engine_kwargs=dict(tensor_parallel_size=1,max_model_len=8192,),)# Build and deploy the modelapp=build_openai_app({"llm_configs":[llm_config]})serve.run(app,blocking=True)

Client

fromopenaiimportOpenAI# Initialize clientclient=OpenAI(base_url="http://localhost:8000/v1",api_key="fake-key")# Create and send a request with an imageresponse=client.chat.completions.create(model="pixtral-12b",messages=[{"role":"user","content":[{"type":"text","text":"What's in this image?"},{"type":"image_url","image_url":{"url":"https://example.com/image.jpg"}}]}],stream=True,)forchunkinresponse:ifchunk.choices[0].delta.contentisnotNone:print(chunk.choices[0].delta.content,end="",flush=True)

Using remote storage for model weights#

You can use remote storage (S3 and GCS) to store your model weights instead ofdownloading them from Hugging Face.

For example, if you have a model stored in S3 that looks like the below structure:

$awss3lsair-example-data/rayllm-ossci/meta-Llama-3.2-1B-Instruct/2025-03-2511:37:481519.gitattributes2025-03-2511:37:487712LICENSE.txt2025-03-2511:37:4841742README.md2025-03-2511:37:486021USE_POLICY.md2025-03-2511:37:48877config.json2025-03-2511:37:48189generation_config.json2025-03-2511:37:482471645608model.safetensors2025-03-2511:37:53296special_tokens_map.json2025-03-2511:37:539085657tokenizer.json2025-03-2511:37:5354528tokenizer_config.json

You can then specify thebucket_uri in themodel_loading_config to point to your S3 bucket.

# config.yamlapplications:-args:llm_configs:-accelerator_type:A10Gengine_kwargs:max_model_len:8192model_loading_config:model_id:my_llamamodel_source:bucket_uri:s3://anonymous@air-example-data/rayllm-ossci/meta-Llama-3.2-1B-Instructimport_path:ray.serve.llm:build_openai_appname:llm_approute_prefix:"/"

Frequently Asked Questions#

How do I use gated Huggingface models?#

You can useruntime_env to specify the env variables that are required to access the model.To set the deployment options, you can use theget_serve_options method on theLLMConfig object.

fromrayimportservefromray.serve.llmimportLLMConfig,LLMServer,LLMRouterimportosllm_config=LLMConfig(model_loading_config=dict(model_id="llama-3-8b-instruct",model_source="meta-llama/Meta-Llama-3-8B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),# Pass the desired accelerator type (e.g. A10G, L4, etc.)accelerator_type="A10G",runtime_env=dict(env_vars=dict(HF_TOKEN=os.environ["HF_TOKEN"])),)# Deploy the applicationdeployment=LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)llm_app=LLMRouter.as_deployment().bind([deployment])serve.run(llm_app,blocking=True)

Why is downloading the model so slow?#

If you are using huggingface models, you can enable fast download by settingHF_HUB_ENABLE_HF_TRANSFER and installingpipinstallhf_transfer.

fromrayimportservefromray.serve.llmimportLLMConfig,LLMServer,LLMRouterimportosllm_config=LLMConfig(model_loading_config=dict(model_id="llama-3-8b-instruct",model_source="meta-llama/Meta-Llama-3-8B-Instruct",),deployment_config=dict(autoscaling_config=dict(min_replicas=1,max_replicas=2,)),# Pass the desired accelerator type (e.g. A10G, L4, etc.)accelerator_type="A10G",runtime_env=dict(env_vars=dict(HF_TOKEN=os.environ["HF_TOKEN"],HF_HUB_ENABLE_HF_TRANSFER="1")),)# Deploy the applicationdeployment=LLMServer.as_deployment(llm_config.get_serve_options(name_prefix="vLLM:")).bind(llm_config)llm_app=LLMRouter.as_deployment().bind([deployment])serve.run(llm_app,blocking=True)

How to configure tokenizer pool size so it doesn’t hang?#

When usingtokenizer_pool_size in vLLM’sengine_kwargs,tokenizer_pool_size is also required to configure together in order to havethe tokenizer group scheduled correctly.

An example config is shown below:

# config.yamlapplications:-args:llm_configs:-engine_kwargs:max_model_len:1000tokenizer_pool_size:2tokenizer_pool_extra_config:"{\"runtime_env\":{}}"model_loading_config:model_id:Qwen/Qwen2.5-7B-Instructimport_path:ray.serve.llm:build_openai_appname:llm_approute_prefix:"/"

Usage Data Collection#

We collect usage data to improve Ray Serve LLM.We collect data about the following features and attributes:

model architecture used for serving
whether JSON mode is used
whether LoRA is used and how many LoRA weights are loaded initially at deployment time
whether autoscaling is used and the min and max replicas setup
tensor parallel size used
initial replicas count
GPU type used and number of GPUs used

If you would like to opt-out from usage data collection, you can followRay usage statsto disable it.

On this page

Edit on GitHub

Movatterモバイル変換

Serving LLMs#

Features#

Requirements#

Key Components#

LLMServer#

LLMRouter#

Configuration#

LLMConfig#

Quickstart Examples#

Deployment throughLLMRouter#

Production Deployment#

Generate config files#

Observability#

Engine Metrics#

Advanced Usage Patterns#

Multi-LoRA Deployment#

Embeddings#

Structured Output#

Vision Language Models#

Using remote storage for model weights#

Frequently Asked Questions#

How do I use gated Huggingface models?#

Why is downloading the model so slow?#

How to configure tokenizer pool size so it doesn’t hang?#

Usage Data Collection#

Serving LLMs #

Deployment through`LLMRouter`#