- Notifications
You must be signed in to change notification settings - Fork28
Incubating P/D sidecar for llm-d
License
llm-d/llm-d-routing-sidecar
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Warning
The code of the routing sidecar has been moved to thellm-d-inference-scheduler repository,under thecmd/pd_sidecar andpkg/sidecar directories. This repository is deprecated and shall soon be archived.All future development will done under the llm-d-inference-scheduler repository.Please donot submit issues or PRs against this repository.
This project provides a reverse proxy redirecting incoming requests to the prefill worker specified in thex-prefiller-host-port HTTP request header.
The sidecar includes SSRF (Server-Side Request Forgery) protection that can be enabled via feature flag. When enabled, it watches a specific InferencePool resource and maintains an allowlist of valid prefill targets based on pods matching the InferencePool selectors.
Note: The allowlist trackshosts/IPs only (not specific ports), so any port on an allowed host is permitted. Thex-prefiller-host-port header should be inhost:port format.
To enable SSRF protection:
# Using environment variables (recommended for Kubernetes deployments)export INFERENCE_POOL_NAMESPACE=your-namespaceexport INFERENCE_POOL_NAME=your-inference-pool./bin/llm-d-routing-sidecar -enable-ssrf-protection=true# Using command line flags (overrides environment variables)./bin/llm-d-routing-sidecar -enable-ssrf-protection=true -inference-pool-namespace=your-namespace -inference-pool-name=your-inference-pool
For automatic namespace injection in Kubernetes deployments, use the downward API:
apiVersion:apps/v1kind:Deploymentmetadata:name:llm-d-routing-sidecarspec:template:spec:containers: -name:sidecarimage:quay.io/llm-d/llm-d-routing-sidecar:latestargs: -"-enable-ssrf-protection=true"env: -name:INFERENCE_POOL_NAMESPACEvalueFrom:fieldRef:fieldPath:metadata.namespace -name:INFERENCE_POOL_NAMEvalue:"my-inference-pool"# Set to your specific InferencePool name
When SSRF protection is enabled:
- Only prefill targets that matchhosts/IPs from pods in the specified InferencePool resource are allowed (any port on allowed hosts)
- Requests to unauthorized targets return HTTP 403 Forbidden
- The allowlist is automatically updated when pods are added/removed/updated
- When disabled (default), all targets are allowed for backward compatibility
- a container engine (docker, podman, etc...)
- a server with at least 2 GPUs
- Start two vLLM servers with P/D enabled via the NIXLConnector.
In one terminal, run this command to start the decoder:
$ podman run --network host --device nvidia.com/gpu=0 -v $HOME/models:/models \ -e UCX_TLS="cuda_ipc,cuda_copy,tcp" \ -e VLLM_NIXL_SIDE_CHANNEL_PORT=5555 \ -e VLLM_NIXL_SIDE_CHANNEL_HOST=localhost \ -e VLLM_LOGGING_LEVEL=DEBUG \ -e HF_HOME=/models ghcr.io/llm-d/llm-d:0.0.8 --model Qwen/Qwen3-0.6B \ --enforce-eager \ --port 8001 \ --kv-transfer-config='{"kv_connector":"NixlConnector","kv_role":"kv_both"}'In another terminal, run this command to start the prefiller:
$ podman run --network host --device nvidia.com/gpu=1 -v $HOME/models:/models \ -e UCX_TLS="cuda_ipc,cuda_copy,tcp" \ -e VLLM_NIXL_SIDE_CHANNEL_PORT=5556 \ -e VLLM_NIXL_SIDE_CHANNEL_HOST=localhost \ -e VLLM_LOGGING_LEVEL=DEBUG \ -e HF_HOME=/models ghcr.io/llm-d/llm-d:0.0.8 --model Qwen/Qwen3-0.6B \ --enforce-eager \ --port 8002 \ --kv-transfer-config='{"kv_connector":"NixlConnector","kv_role":"kv_both"}'- Clone and start the routing proxy.
In another terminal, clone this repository and build the routing proxy:
$ git clone https://github.com/llm-d/llm-d-routing-sidecar.git && \ cd llm-d-routing-sidecar && \ make buildIn the same terminal, start the routing proxy:
$ ./bin/llm-d-routing-sidecar -port=8000 -vllm-port=8001 -connector=nixlv2- Send a request.
Finally, in another terminal, send a chat completions request to the router proxy on port 8000 and tell it to use the prefiller on port 8002:
$ curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -H "x-prefiller-host-port: http://localhost:8002" \ -d '{ "model": "Qwen/Qwen3-0.6B", "prompt": "Author-contribution statements and acknowledgements in research papers should state clearly and specifically whether, and to what extent, the authors used AI technologies such as ChatGPT in the preparation of their manuscript and analysis. They should also indicate which LLMs were used. This will alert editors and reviewers to scrutinize manuscripts more carefully for potential biases, inaccuracies and improper source crediting. Likewise, scientific journals should be transparent about their use of LLMs, for example when selecting submitted manuscripts. Mention the large language model based product mentioned in the paragraph above:" }'- Verification
Observe the request is processed by both the prefiller and then the decoder:
Prefiller logs (trimmed for clarity purpose):
...Received request cmpl-386a7566-31a0-11f0-b783-0200017aca02-0: prompt: ......NIXLConnector request_finished, request_status=7 ...Decoder logs (trimmed for clarity purpose):
...Received request cmpl-386a7566-31a0-11f0-b783-0200017aca02-0: prompt: ......DEBUG 05-15 15:21:06 [nixl_connector.py:646] start_load_kv for request cmpl-386a7566-31a0-11f0-b783-0200017aca02-0 from remote engine c53ac287-36cd-4b52-811f-16de4e4bd3a5. Num local_block_ids: 6. Num remote_block_ids: 6...Build the routing proxy from source:
$ make build
Check the build was successful by running it locally:
$ ./bin/llm-d-routing-sidecar -helpUsage of ./bin/llm-d-routing-sidecar: -add_dir_header If true, adds the file directory to the header of the log messages -alsologtostderr log to standard error as well as files (no effect when -logtostderr=true) -cert-path string The path to the certificate for secure proxy. The certificate and private key files are assumed to be named tls.crt and tls.key, respectively. If not set, and secureProxy is enabled, then a self-signed certificate is used (for testing). -connector string the P/D connector being used. Either nixl, nixlv2 or lmcache (default "nixl") -decoder-tls-insecure-skip-verify configures the proxy to skip TLS verification for requests to decoder -decoder-use-tls whether to use TLS when sending requests to the decoder -enable-ssrf-protection enable SSRF protection using InferencePool allowlisting -inference-pool-name string the specific InferencePool name to watch (defaults to INFERENCE_POOL_NAME env var) -inference-pool-namespace string the Kubernetes namespace to watch for InferencePool resources (defaults to INFERENCE_POOL_NAMESPACE env var) -log_backtrace_at value when logging hits line file:N, emit a stack trace -log_dir string If non-empty, write log files in this directory (no effect when -logtostderr=true) -log_file string If non-empty, use this log file (no effect when -logtostderr=true) -log_file_max_size uint Defines the maximum size a log file can grow to (no effect when -logtostderr=true). Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800) -logtostderr log to standard error instead of files (default true) -one_output If true, only write logs to their native severity level (vs also writing to each lower severity level; no effect when -logtostderr=true) -port string the port the sidecar is listening on (default "8000") -prefiller-tls-insecure-skip-verify configures the proxy to skip TLS verification for requests to prefiller -prefiller-use-tls whether to use TLS when sending requests to prefillers -secure-proxy Enables secure proxy. Defaults to true. (default true) -skip_headers If true, avoid header prefixes in the log messages -skip_log_headers If true, avoid headers when opening log files (no effect when -logtostderr=true) -stderrthreshold value logs at or above this threshold go to stderr when writing to files and stderr (no effect when -logtostderr=true or -alsologtostderr=true) (default 2) -v value number for the log level verbosity -vllm-port string the port vLLM is listening on (default "8001") -vmodule value comma-separated list of pattern=N settings for file-filtered loggingNote: lmcache and nixl connectors are deprecated. Use nixlv2
This project is licensed under the Apache License 2.0. See theLICENSE file for details.
About
Incubating P/D sidecar for llm-d
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors13
Uh oh!
There was an error while loading.Please reload this page.