Home
Login

LMDeploy is a toolkit for compressing, deploying, and serving large language models.

Apache-2.0Python 6.6kInternLM Last Updated: 2025-06-19

LMDeploy Project Detailed Introduction

Project Overview

LMDeploy is a toolkit for compressing, deploying, and serving large language models, developed by the MMRazor and MMDeploy teams. This project focuses on providing efficient inference, deployment, and serving solutions for large language models (LLMs) and vision-language models (VLMs).

Core Features

1. Efficient Inference

LMDeploy achieves 1.8x higher request throughput than vLLM by introducing key features such as continuous batching, paged KV cache, dynamic split fuse, tensor parallelism, and high-performance CUDA kernels.

2. Effective Quantization

LMDeploy supports weight quantization and k/v quantization, achieving 2.4x higher performance than FP16 with 4-bit inference. Quantization quality has been confirmed through OpenCompass evaluation.

3. Effortless Distribution Server

Leveraging request dispatching services, LMDeploy facilitates easy and efficient deployment of multi-model services across multiple machines and cards.

4. Interactive Inference Mode

By caching attention k/v during multi-turn conversations, the engine remembers the conversation history, avoiding redundant processing of historical sessions.

5. Excellent Compatibility

LMDeploy supports simultaneous use of KV Cache quantization, AWQ, and automatic prefix caching.

Dual-Engine Architecture

LMDeploy has developed two inference engines:

TurboMind Engine

  • Focus: Pursuing ultimate optimization of inference performance
  • Features: Highly optimized C++/CUDA implementation, designed for production environments

PyTorch Engine

  • Focus: Pure Python development, lowering the barrier for developers
  • Features: Facilitates rapid experimentation with new features and technologies, easy to extend and customize

The two engines differ in the types of models and inference data types they support, allowing users to choose the appropriate engine based on their actual needs.

Supported Models

LMDeploy supports a wide range of model types:

Large Language Models (LLMs)

  • InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3)
  • Llama series (Llama2, Llama3, Llama3.1)
  • Qwen series (Qwen1.5, Qwen1.5-MOE, etc.)
  • Baichuan2 series
  • Mistral, Mixtral
  • DeepSeek series
  • Gemma
  • Code Llama
  • More models are continuously being added

Vision-Language Models (VLMs)

  • InternVL series
  • InternLM-XComposer series
  • LLaVA series
  • CogVLM series
  • Mini-InternVL
  • DeepSeek-VL
  • More multimodal models

Installation

Quick Installation

Recommended to install using pip in a conda environment (supports Python 3.8-3.12):

conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy

Notes

  • Default pre-built packages are compiled based on CUDA 12 (from v0.3.0 onwards)
  • Supports installation on CUDA 11+ platforms
  • Supports building from source

Quick Usage Examples

Basic Inference

import lmdeploy
with lmdeploy.pipeline("internlm/internlm3-8b-instruct") as pipe:
    response = pipe(["Hi, pls intro yourself", "Shanghai is"])
    print(response)

Multimodal Inference

from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL2-8B')
image = load_image('path/to/image.jpg')
response = pipe(('Describe this image', image))
print(response)

Model Source Support

LMDeploy supports multiple model hubs:

  1. HuggingFace (default)
  2. ModelScope: Set environment variable LMDEPLOY_USE_MODELSCOPE=True
  3. openMind Hub: Set environment variable LMDEPLOY_USE_OPENMIND_HUB=True

Application Scenarios

  1. Production Environment Deployment: High-throughput LLM services
  2. R&D Experiments: Rapidly validate new models and algorithms
  3. Resource-Constrained Environments: Reduce resource requirements through quantization techniques
  4. Multimodal Applications: Efficient inference of vision-language models
  5. Edge Devices: Supports platforms such as NVIDIA Jetson

Ecosystem Integration

LMDeploy is deeply integrated with several open-source projects:

  • OpenAOE: Seamlessly integrates LMDeploy services
  • Swift: Serves as the default VLM inference accelerator
  • BentoML: Provides deployment example projects
  • Jetson Platform: Dedicated adaptation for edge devices

Summary

LMDeploy is a powerful and high-performance toolkit for deploying large language models, suitable for various scenarios from R&D experiments to production deployments. Its dual-engine architecture, advanced quantization techniques, and broad model support make it an important tool choice for AI application developers. Whether it's a production environment pursuing ultimate performance or an R&D scenario requiring rapid iteration, LMDeploy can provide the right solution.