Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

License

NotificationsYou must be signed in to change notification settings

InternLM/lmdeploy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Latest News 🎉

2025
2024
  • [2024/11] Support Mono-InternVL with PyTorch engine
  • [2024/10] PyTorchEngine supports graph mode on ascend platform, doubling the inference speed
  • [2024/09] LMDeploy PyTorchEngine adds support forHuawei Ascend. See supported modelshere
  • [2024/09] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
  • [2024/08] LMDeploy is integrated intomodelscope/swift as the default accelerator for VLMs inference
  • [2024/07] Support Llama3.1 8B, 70B and its TOOLS CALLING
  • [2024/07] SupportInternVL2 full-series models,InternLM-XComposer2.5 andfunction call of InternLM2.5
  • [2024/06] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
  • [2024/05] Balance vision model when deploying VLMs with multiple GPUs
  • [2024/05] Support 4-bits weight-only quantization and inference on VLMs, such as InternVL v1.5, LLaVa, InternLMXComposer2
  • [2024/04] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
  • [2024/04] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Referhere for detailed guide
  • [2024/04] TurboMind latest upgrade boosts GQA, rocketing theinternlm2-20b model inference to 16+ RPS, about 1.8x faster than vLLM.
  • [2024/04] Support Qwen1.5-MOE and dbrx.
  • [2024/03] Support DeepSeek-VL offline inference pipeline and serving.
  • [2024/03] Support VLM offline inference pipeline and serving.
  • [2024/02] Support Qwen 1.5, Gemma, Mistral, Mixtral, Deepseek-MOE and so on.
  • [2024/01]OpenAOE seamless integration withLMDeploy Serving Service.
  • [2024/01] Support for multi-model, multi-machine, multi-card inference services. For usage instructions, please refer tohere
  • [2024/01] SupportPyTorch inference engine, developed entirely in Python, helping to lower the barriers for developers and enable rapid experimentation with new features and technologies.
2023
  • [2023/12] Turbomind supports multimodal input.
  • [2023/11] Turbomind supports loading hf model directly. Clickhere for details.
  • [2023/11] TurboMind major upgrades, including: Paged Attention, faster attention kernels without sequence length limitation, 2x faster KV8 kernels, Split-K decoding (Flash Decoding), and W4A16 inference for sm_75
  • [2023/09] TurboMind supports Qwen-14B
  • [2023/09] TurboMind supports InternLM-20B
  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Clickhere for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation. Checkthis guide for detailed info
  • [2023/08] LMDeploy has launched on theHuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using theAWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by theMMRazor andMMDeploy teams. It has the following core features:

  • Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.

  • Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.

  • Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.

  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.

  • Excellent Compatibility: LMDeploy supportsKV Cache Quant,AWQ andAutomatic Prefix Caching to be used simultaneously.

Performance

v0 1 0-benchmark

Supported Models

LLMsVLMs
  • Llama (7B - 65B)
  • Llama2 (7B - 70B)
  • Llama3 (8B, 70B)
  • Llama3.1 (8B, 70B)
  • Llama3.2 (1B, 3B)
  • InternLM (7B - 20B)
  • InternLM2 (7B - 20B)
  • InternLM3 (8B)
  • InternLM2.5 (7B)
  • Qwen (1.8B - 72B)
  • Qwen1.5 (0.5B - 110B)
  • Qwen1.5 - MoE (0.5B - 72B)
  • Qwen2 (0.5B - 72B)
  • Qwen2-MoE (57BA14B)
  • Qwen2.5 (0.5B - 32B)
  • Baichuan (7B)
  • Baichuan2 (7B-13B)
  • Code Llama (7B - 34B)
  • ChatGLM2 (6B)
  • GLM4 (9B)
  • CodeGeeX4 (9B)
  • Falcon (7B - 180B)
  • YI (6B-34B)
  • Mistral (7B)
  • DeepSeek-MoE (16B)
  • DeepSeek-V2 (16B, 236B)
  • DeepSeek-V2.5 (236B)
  • Mixtral (8x7B, 8x22B)
  • Gemma (2B - 7B)
  • Dbrx (132B)
  • StarCoder2 (3B - 15B)
  • Phi-3-mini (3.8B)
  • Phi-3.5-mini (3.8B)
  • Phi-3.5-MoE (16x3.8B)
  • MiniCPM3 (4B)
  • LLaVA(1.5,1.6) (7B-34B)
  • InternLM-XComposer2 (7B, 4khd-7B)
  • InternLM-XComposer2.5 (7B)
  • Qwen-VL (7B)
  • Qwen2-VL (2B, 7B, 72B)
  • Qwen2.5-VL (3B, 7B, 72B)
  • DeepSeek-VL (7B)
  • DeepSeek-VL2 (3B, 16B, 27B)
  • InternVL-Chat (v1.1-v1.5)
  • InternVL2 (1B-76B)
  • InternVL2.5(MPO) (1B-78B)
  • Mono-InternVL (2B)
  • ChemVLM (8B-26B)
  • MiniGeminiLlama (7B)
  • CogVLM-Chat (17B)
  • CogVLM2-Chat (19B)
  • MiniCPM-Llama3-V-2_5
  • MiniCPM-V-2_6
  • Phi-3-vision (4.2B)
  • Phi-3.5-vision (4.2B)
  • GLM-4V (9B)
  • Llama3.2-vision (11B, 90B)
  • Molmo (7B-D,72B)
  • Gemma3 (1B - 27B)

LMDeploy has developed two inference engines -TurboMind andPyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.

They differ in the types of supported models and the inference data type. Please refer tothis table for each engine's capability and choose the proper one that best fits your actual needs.

Quick StartOpen In Colab

Installation

It is recommended installing lmdeploy using pip in a conda environment (python 3.8 - 3.12):

conda create -n lmdeploy python=3.8 -yconda activate lmdeploypip install lmdeploy

The default prebuilt package is compiled onCUDA 12 since v0.3.0.For more information on installing on CUDA 11+ platform, or for instructions on building from source, please refer to theinstallation guide.

Offline Batch Inference

importlmdeploywithlmdeploy.pipeline("internlm/internlm3-8b-instruct")aspipe:response=pipe(["Hi, pls intro yourself","Shanghai is"])print(response)

Note

By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope bypip install modelscope and set the environment variable:

export LMDEPLOY_USE_MODELSCOPE=True

If you would like to use models from openMind Hub, please install openMind Hub bypip install openmind_hub and set the environment variable:

export LMDEPLOY_USE_OPENMIND_HUB=True

For more information about inference pipeline, please refer tohere.

Tutorials

Please reviewgetting_started section for the basic usage of LMDeploy.

For detailed user guides and advanced guides, please refer to ourtutorials:

Third-party projects

  • Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy:LMDeploy-Jetson

  • Example project for deploying LLMs using LMDeploy and BentoML:BentoLMDeploy

Contributing

We appreciate all contributions to LMDeploy. Please refer toCONTRIBUTING.md for the contributing guideline.

Acknowledgement

Citation

@misc{2023lmdeploy,title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},author={LMDeploy Contributors},howpublished ={\url{https://github.com/InternLM/lmdeploy}},year={2023}}

License

This project is released under theApache 2.0 license.


[8]ページ先頭

©2009-2025 Movatter.jp