Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Distributed KV cache coordinator

License

NotificationsYou must be signed in to change notification settings

llm-d/llm-d-kv-cache-manager

Go Report CardGo ReferenceLicenseJoin Slack

KV-Cache Manager

Introduction

Efficiently caching Key & Value (KV) tensors is crucial for optimizing LLM inference.Reusing the KV-Cache, rather than recomputing it, significantly improves both Time To First Token (TTFT) and overall throughput, while also maximizing system resource-utilization.As a distributed LLM inference platform,llm-d provides a comprehensive suite of KV-Cache management capabilities to achieve these goals.

This repository contains thellm-d-kv-cache-manager, a pluggable service designed to enableKV-Cache Aware Routing and lay the foundation for advanced, cross-node cache coordination in vLLM-based serving platforms.

Project Northstar

See theProject Northstar document for a detailed overview of the project's goals and vision.


KV-Cache Indexer Overview

The major component of this project is theKV-Cache Indexer is a high-performance library that keeps a global, near-real-time view of KV-Cache block locality across a fleet of vLLM pods.

It is powered byKVEvents streamed from vLLM, which provide structured metadata as KV-blocks are created or evicted from a vLLM instance's KV-cache.This allows the indexer to track which blocks reside on which nodes and on which tier (e.g., GPU or CPU).This metadata is the foundation for intelligent routing, enabling schedulers to make optimal, KV-cache-aware placement decisions.

The diagram below shows the primary data flows: theRead Path (scoring) and theWrite Path (event ingestion).

graph TD    subgraph "Inference Scheduler"        A[Scheduler]        subgraph "KV-Cache Manager"            B[`kvcache.Indexer`]            C[`kvblock.Index`]            D[`kvevents.Pool`]        end    end    subgraph "vLLM Fleet"        E[vLLM Pod 1]        F[vLLM Pod 2]        G[...]    end    A--"1: Score(prompt, pods)"-->B    B--"2: Query Index"-->C    B--"3: Return Scores"-->A        E--"A: Emit KVEvents"-->D    F--"A: Emit KVEvents"-->D    D--"B: Update Index"-->C
Loading

Read Path:

  • 1:Scoring Request: A scheduler asks theKVCache Indexer to score a set of pods for a given prompt
  • 2:Index Query: The indexer calculates the necessary KV-block keys from the prompt and queries theKV-Block Index to see which pods have those blocks
  • 3:Return Scores: The indexer returns a map of pods and their corresponding KV-cache-hit scores to the scheduler

Write Path:

  • A:Event Ingestion: As vLLM pods create or evict KV-blocks, they emitKVEvents containing metadata about these changes
  • B:Index Update: TheEvent Subscriber consumes these events and updates theKV-Block Index in near-real-time

For a more detailed breakdown, please see the high-levelArchitecture and theConfiguration docs.


Examples

  • KVCache Indexer:A reference implementation showing how to run and use thekvcache.Indexer module
  • KVCache Aware Scorer:A reference implementation of how to integrate thekvcache.Indexer into a scheduler like thellm-d-inference-scheduler
  • KV-Events:Demonstrates how the KV-Cache Manager handles KV-Events through both an offline example with a dummy ZMQ publisher and an online example using a vLLM Helm chart.
  • Valkey Backend:Shows how to configure and use Valkey as the backend for KV-block indexing, including RDMA support for high-performance scenarios.

About

Distributed KV cache coordinator

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors32


[8]ページ先頭

©2009-2025 Movatter.jp