InternLM/lmdeployPlease refer to the latest official releases for information GitHub Homepage

LMDeploy is a toolkit for compressing, deploying, and serving large language models.

Apache-2.0Python 6.6kInternLM Last Updated: 2025-06-19

LMDeploy Project Detailed Introduction

Project Overview

LMDeploy is a toolkit for compressing, deploying, and serving large language models, developed by the MMRazor and MMDeploy teams. This project focuses on providing efficient inference, deployment, and serving solutions for large language models (LLMs) and vision-language models (VLMs).

Core Features

1. Efficient Inference

LMDeploy achieves 1.8x higher request throughput than vLLM by introducing key features such as continuous batching, paged KV cache, dynamic split fuse, tensor parallelism, and high-performance CUDA kernels.

2. Effective Quantization

LMDeploy supports weight quantization and k/v quantization, achieving 2.4x higher performance than FP16 with 4-bit inference. Quantization quality has been confirmed through OpenCompass evaluation.

3. Effortless Distribution Server

Leveraging request dispatching services, LMDeploy facilitates easy and efficient deployment of multi-model services across multiple machines and cards.

4. Interactive Inference Mode

By caching attention k/v during multi-turn conversations, the engine remembers the conversation history, avoiding redundant processing of historical sessions.

5. Excellent Compatibility

LMDeploy supports simultaneous use of KV Cache quantization, AWQ, and automatic prefix caching.

Dual-Engine Architecture

LMDeploy has developed two inference engines:

TurboMind Engine

Focus: Pursuing ultimate optimization of inference performance
Features: Highly optimized C++/CUDA implementation, designed for production environments

PyTorch Engine

Focus: Pure Python development, lowering the barrier for developers
Features: Facilitates rapid experimentation with new features and technologies, easy to extend and customize

The two engines differ in the types of models and inference data types they support, allowing users to choose the appropriate engine based on their actual needs.

Supported Models

LMDeploy supports a wide range of model types:

Large Language Models (LLMs)

InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3)
Llama series (Llama2, Llama3, Llama3.1)
Qwen series (Qwen1.5, Qwen1.5-MOE, etc.)
Baichuan2 series
Mistral, Mixtral
DeepSeek series
Gemma
Code Llama
More models are continuously being added

Vision-Language Models (VLMs)

InternVL series
InternLM-XComposer series
LLaVA series
CogVLM series
Mini-InternVL
DeepSeek-VL
More multimodal models

Installation

Quick Installation

Recommended to install using pip in a conda environment (supports Python 3.8-3.12):

conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy

Notes

Default pre-built packages are compiled based on CUDA 12 (from v0.3.0 onwards)
Supports installation on CUDA 11+ platforms
Supports building from source

Quick Usage Examples

Basic Inference

import lmdeploy
with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe:
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

Multimodal Inference

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B')
image = load_image('path/to/image.jpg')
response = pipe(('Describe this image', image))
print(response)

Model Source Support

LMDeploy supports multiple model hubs:

HuggingFace (default)
ModelScope: Set environment variable LMDEPLOY_USE_MODELSCOPE=True
openMind Hub: Set environment variable LMDEPLOY_USE_OPENMIND_HUB=True

Application Scenarios

Production Environment Deployment: High-throughput LLM services
R&D Experiments: Rapidly validate new models and algorithms
Resource-Constrained Environments: Reduce resource requirements through quantization techniques
Multimodal Applications: Efficient inference of vision-language models
Edge Devices: Supports platforms such as NVIDIA Jetson

Ecosystem Integration

LMDeploy is deeply integrated with several open-source projects:

OpenAOE: Seamlessly integrates LMDeploy services
Swift: Serves as the default VLM inference accelerator
BentoML: Provides deployment example projects
Jetson Platform: Dedicated adaptation for edge devices

Summary

LMDeploy is a powerful and high-performance toolkit for deploying large language models, suitable for various scenarios from R&D experiments to production deployments. Its dual-engine architecture, advanced quantization techniques, and broad model support make it an important tool choice for AI application developers. Whether it's a production environment pursuing ultimate performance or an R&D scenario requiring rapid iteration, LMDeploy can provide the right solution.