LMDeploy is a toolkit for compressing, deploying, and serving large language models, developed by the MMRazor and MMDeploy teams. This project focuses on providing efficient inference, deployment, and serving solutions for large language models (LLMs) and vision-language models (VLMs).
LMDeploy achieves 1.8x higher request throughput than vLLM by introducing key features such as continuous batching, paged KV cache, dynamic split fuse, tensor parallelism, and high-performance CUDA kernels.
LMDeploy supports weight quantization and k/v quantization, achieving 2.4x higher performance than FP16 with 4-bit inference. Quantization quality has been confirmed through OpenCompass evaluation.
Leveraging request dispatching services, LMDeploy facilitates easy and efficient deployment of multi-model services across multiple machines and cards.
By caching attention k/v during multi-turn conversations, the engine remembers the conversation history, avoiding redundant processing of historical sessions.
LMDeploy supports simultaneous use of KV Cache quantization, AWQ, and automatic prefix caching.
LMDeploy has developed two inference engines:
The two engines differ in the types of models and inference data types they support, allowing users to choose the appropriate engine based on their actual needs.
LMDeploy supports a wide range of model types:
Recommended to install using pip in a conda environment (supports Python 3.8-3.12):
conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy
import lmdeploy
with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe:
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL2-8B')
image = load_image('path/to/image.jpg')
response = pipe(('Describe this image', image))
print(response)
LMDeploy supports multiple model hubs:
LMDEPLOY_USE_MODELSCOPE=True
LMDEPLOY_USE_OPENMIND_HUB=True
LMDeploy is deeply integrated with several open-source projects:
LMDeploy is a powerful and high-performance toolkit for deploying large language models, suitable for various scenarios from R&D experiments to production deployments. Its dual-engine architecture, advanced quantization techniques, and broad model support make it an important tool choice for AI application developers. Whether it's a production environment pursuing ultimate performance or an R&D scenario requiring rapid iteration, LMDeploy can provide the right solution.