Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Achieve state of the art inference performance with modern accelerators on Kubernetes

License

NotificationsYou must be signed in to change notification settings

llm-d/llm-d

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

llm-d Logo

DocumentationRelease StatusLicenseJoin Slack

Latest News 🔥

  • [2025-10] Ourv0.3 release delivers Intel XPU and Google TPU support, TCP and RDMA over RoCE tested with disaggregation, new experimental predicted latency balancing for up to 3x better P90 latency on long prefill, DeepSeek Expert Parallel serving reaching 2.2k output tokens/s/gpu on H200 and 2.9k output tokens/s/gpu on B200, and integrates the Inference Gateway v1.0 GA release.
  • [2025-08] Read more about theintelligent inference scheduler, including a deep dive on how different balancing techniques are composed to improve throughput without overloading replicas.

📄 About

llm-d is a Kubernetes-native distributed inference serving stack providing well-lit paths for anyone to serve large generative AI models at scale. We help you achieve the fastest "time to state-of-the-art (SOTA) performance" for key OSS models across most hardware accelerators and infrastructure providers.

Ourwell-lit paths provide tested and benchmarked recipes and Helm charts to start serving quickly with best practices common to production deployments. They are extensible and customizable for particulars of your models and use cases, using popular open source components like Kubernetes, Envoy proxy, NIXL, and vLLM. Our intent is to eliminate the heavy lifting common in tuning and deploying inference at scale.

We currently offer three tested and benchmarked paths to help deploying large models:

  1. Intelligent Inference Scheduling - DeployvLLM behind theInference Gateway (IGW) to decrease serving latency and increase throughput withpredicted latency balancing (experimental),precise prefix-cache aware routing, andcustomizable scheduling policies.
  2. Prefill/Decode Disaggregation - Reduce time to first token (TTFT) and get more predictable time per output token (TPOT) by splitting inference into prefill servers handling prompts and decode servers handling responses, primarily on large models such as Llama-70B and when processing very long prompts.
  3. Wide Expert-Parallelism - Deploy very large Mixture-of-Experts (MoE) models likeDeepSeek-R1 and significantly reduce end-to-end latency and increase throughput by scaling up withData Parallelism and Expert Parallelism over fast accelerator networks.

Hardware Support

llm-d directly tests and validates multiple accelerator types including NVIDIA GPUs, AMD GPUs, Google TPUs, and Intel XPUs and provides common operational patterns to improve production reliability.

See theaccelerator docs for points of contact for more details about the accelerators, networks, and configurations tested and ourroadmap for what is coming next.

What is in scope for llm-d

llm-d currently targets improving the production serving experience around:

  • Online serving and online batch of Generative models running in PyTorch or JAX
    • Large language models (LLMs) with 1 billion or more parameters
    • Using most or all of the capacity of one or more hardware accelerators
    • Running in throughput, latency, or multiple-objective configurations
  • On recent generation datacenter-class accelerators
    • NVIDIA A100 / L4 or newer
    • AMD MI250 or newer
    • Google TPU v5e or newer
    • Intel Data Center GPU Max (XPU/Ponte Vecchio) series or newer
  • With extremely fast accelerator interconnect and datacenter networking
    • 600-16,000 Gbps per accelerator NVLINK on host or across narrow domains like NVL72
    • 1,600-5,000 Gbps per chip TPU OCS links within TPU pods
    • 100-1,600 Gbps per host datacenter networking across broad (>128 host) domains
  • Kubernetes 1.29+ as the hardware orchestrator
    • in large (100-100k node) reserved cloud capacity or datacenters, often overlapping with AI batch and training
    • in medium (10-1k node) cloud deployments with a mix of reserved, on-demand, or spot capacity
    • in small (1-10 node) test and qualification environments with a static footprint, often time shared

Our upstream projects – particularly vLLM and Kubernetes – support a broader array of models, accelerators, and networks that may also benefit from our work, but we concentrate on optimizing and standardizing the operational and automation challenges of the leading edge inference workloads.

🧱 Architecture

llm-d accelerates distributed inference by integrating industry-standard open technologies: vLLM as default model server and engine, Inference Gateway as request scheduler and balancer, and Kubernetes as infrastructure orchestrator and workload control plane.

llm-d Arch

Key features of llm-d include:

  • vLLM-Optimized Inference Scheduler: llm-d builds on IGW's pattern of leveraging the Envoy proxy and its extensible balancing policies to make customizable “smart” load-balancing decisions specifically for LLMs. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness. Advanced users can implement their own scorers to further customize the algorithm while benefiting from IGW features like flow control and latency-aware balancing.See our Northstar design

  • Disaggregated Serving with vLLM: llm-d orchestrates prefill and decode phases onto independent instances - the scheduler decides which instances should receive a given request, and the transaction is coordinated via a sidecar alongside decode instances. The sidecar instructs vLLM to provide point to point KV cache transfer over fast interconnects (IB/RoCE RDMA, TPU ICI, and DCN) via NIXL.See our Northstar design

  • Disaggregated Prefix Caching : llm-d uses vLLM's KVConnector abstraction to configure a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache. We plan to support two KV caching schemes.See our Northstar design

    • Independent (N/S) caching with offloading to local memory and disk, providing a zero operational cost mechanism for offloading.
    • Shared (E/W) caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
  • Variant Autoscaling over Hardware, Workload, and Traffic: A traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency.See our Northstar design

For more see theproject proposal.

🚀 Getting Started

llm-d can be installed as a full solution, customizing enabled features, or through its individual components for experimentation.

Pre-requisites

llm-d requires accelerators capable of running large models. Our well-lit paths are focused on datacenter accelerators and networks and issues encountered outside these may not receive the same level of attention.

See theprerequisites for our guides for more details on supported hardware, networking, Kubernetes cluster configuration, and client tooling.

Deploying llm-d

llm-d provides Helm charts that deploy theinference scheduler and a parameterizeddeployment of vLLM that demonstratesa number of different production configurations.

We bundle these with guides to ourwell-lit paths with key decisions, tradeoffs, benchmarks, and recommended configuration.

We suggest theinference scheduling well-lit path if you need a simple, production ready deployment of vLLM with optimized load balancing.

Tip

For a more in-depth introduction to llm-d, try ourstep-by-step quickstart.

Experimenting and developing with llm-d

llm-d is composed of multiple component repositories and derives from both vLLM and Inference Gateway upstreams. Please see the individual repositories for more guidance on development.

📦 Releases

Ourguides are living docs and kept current. For details about the Helm charts and component releases, visit ourGitHub Releases page to review release notes.

Check out ourroadmap for upcoming releases.

Contribute

  • Seeour project overview for more details on our development process and governance.
  • Reviewour contributing guidelines for detailed information on how to contribute to the project.
  • Join one of ourSpecial Interest Groups (SIGs) to contribute to specific areas of the project and collaborate with domain experts.
  • We use Slack to discuss development across organizations. Please join:Slack
  • We host a weekly standup for contributors on Wednesdays at 12:30 PM ET, as well as meetings for various SIGs. You can find them in theshared llm-d calendar
  • We use Google Groups to share architecture diagrams and other content. Please join:Google Group

License

This project is licensed under Apache License 2.0. See theLICENSE file for details.

About

Achieve state of the art inference performance with modern accelerators on Kubernetes

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

[8]ページ先頭

©2009-2025 Movatter.jp