Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Redis for LLMs

License

NotificationsYou must be signed in to change notification settings

LMCache/LMCache

lmcache logo

|Blog |Documentation |Join Slack |Interest Form |Official Email |

💡 What is LMCache?

TL;DR - Redis for LLMs.

LMCache is aLLM serving engine extension toreduce TTFT andincrease throughput, especially under long-context scenarios. By storing the KV caches of reusable texts across various locations including (GPU, CPU DRAM, Local Disk), LMCache reuse the KV caches ofany reused text (not necessarily prefix) inany serving engine instance. Thus, LMCache saves precious GPU cycles and reduces response delay for users.

By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

Try LMCache with pre-built vllm docker imageshere.

🚀 Performance snapshot

image

💻 Quickstart

LMCache provides the integration to the latest vLLM (0.6.2). To install LMCache, use the following command:

# requires python >= 3.10 and nvcc >= 12.1pip install lmcache lmcache_vllm

LMCache has the same interface as vLLM (both online serving and offline inference).To use the online serving, you can start an OpenAI API-compatible vLLM server with LMCache via:

lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8

To use vLLM's offline inference with LMCache, just simply addlmcache_vllm before the import to vLLM components. For example

importlmcache_vllm.vllmasvllmfromlmcache_vllm.vllmimportLLM

More detailed documentation will be available soon.

- Sharing KV cache across multiple vLLM instances

LMCache supports sharing KV across different vLLM instances by thelmcache.server module. Here is a quick guide:

# Start lmcache serverlmcache_server localhost 65432

Then, start two vLLM instances with the LMCache config file

wget https://raw.githubusercontent.com/LMCache/LMCache/refs/heads/dev/examples/example.yaml# start the first vLLM instanceLMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=0 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8000# start the second vLLM instanceLMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=1 lmcache_vllm serve lmsys/longchat-7b-16k --gpu-memory-utilization 0.8 --port 8001

- What's next

We also provide multiple docker-based demos at🔗LMCache-demos repo. The demos cover the following use cases:

  • Share KV caches across multiple serving engines(🔗link)
  • Loading non-prefix KV caches for RAG(🔗link)

Interested in Connecting?

Fill out the interest form and our team will reach out to you!https://forms.gle/mQfQDUXbKfp2St1z7

🛣️ Incoming Milestones

  • First release of LMCache
  • Support installation through pip install and integrate with latest vLLM
  • Stable support for non-prefix KV caches
  • User and developer documentation

📖 Blogs and documentations

Ourblog posts anddocumentations are available online

Community meeting

Citation

If you use LMCache for your research, please cite our papers:

@inproceedings{liu2024cachegen,  title={Cachegen: Kv cache compression and streaming for fast large language model serving},  author={Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and others},  booktitle={Proceedings of the ACM SIGCOMM 2024 Conference},  pages={38--56},  year={2024}}@article{cheng2024large,  title={Do Large Language Models Need a Content Delivery Network?},  author={Cheng, Yihua and Du, Kuntai and Yao, Jiayi and Jiang, Junchen},  journal={arXiv preprint arXiv:2409.13761},  year={2024}}@article{yao2024cacheblend,  title={CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion},  author={Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen},  journal={arXiv preprint arXiv:2405.16444},  year={2024}}

[8]ページ先頭

©2009-2025 Movatter.jp