PaddlePaddle/FastDeployPlease refer to the latest official releases for information GitHub Homepage

An easy-to-use and fast deep learning and large language model deployment toolkit that supports cloud, mobile, and edge deployment. It includes 20+ mainstream scenarios and 150+ SOTA models for image, video, text, and audio, with end-to-end optimization, multi-platform, and multi-framework support.

Apache-2.0Cuda 3.2kPaddlePaddle Last Updated: 2025-06-16

FastDeploy Project Detailed Introduction

Project Overview

FastDeploy is an open-source deep learning model deployment toolkit from the PaddlePaddle team at Baidu, focusing on providing developers with easy-to-use, high-performance AI model deployment solutions. The project aims to lower the technical barriers to deploying deep learning models from training to production environments, supporting multiple platforms and model types.

Project Address: https://github.com/PaddlePaddle/FastDeploy

Key Features

🚀 Core Advantages

Easy to Use: Provides concise API interfaces, enabling model deployment with a single command.
High Performance: Deeply optimized for different hardware platforms, delivering ultimate inference performance.
Multi-Platform Support: Covers various deployment scenarios, including cloud, mobile, and edge.
Multi-Framework Compatibility: Supports mainstream deep learning frameworks such as PaddlePaddle, PyTorch, and TensorFlow.

🎯 Version Highlights

FastDeploy 2.0 Version Highlights

Large Language Model Support: Specifically optimized for large model inference, currently supporting the Qwen2 model, with more models continuously being updated.
Service-Oriented Deployment: Quickly implement service-oriented model deployment with a single command, supporting streaming generation.
Tensor Parallelism Technology: Utilizes tensor parallelism to accelerate large model inference performance.
Advanced Features:
- Supports PagedAttention and continuous batching.
- Compatible with OpenAI's HTTP protocol.
- Provides Weight only int8/int4 lossless compression schemes.
- Supports Prometheus Metrics monitoring.

Supported Scenarios and Models

📱 Application Scenarios

Image Processing: Image classification, object detection, image segmentation, OCR recognition, etc.
Video Analysis: Action recognition, video understanding, real-time video processing, etc.
Natural Language Processing: Text classification, sentiment analysis, question answering systems, large language model inference, etc.
Speech Processing: Speech recognition, speech synthesis, speech analysis, etc.

🏆 Model Ecosystem

Supports 150+ SOTA models
Covers 20+ mainstream application scenarios
End-to-end optimized model deployment process

Technical Architecture

🔧 System Requirements

For large model deployment (2.0 version):

Hardware Requirements: A800/H800/H100 GPU
Software Environment:
- Python >= 3.10
- CUDA >= 12.3
- CUDNN >= 9.5
- Linux X64 operating system

🛠️ Deployment Methods

Docker Deployment: Provides pre-built Docker images.
Source Code Compilation: Supports compilation and installation from source code.
Python Package Installation: Directly install via pip.

Quick Start

Installation Methods

1. Docker Method

docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy:2.0.0.0-alpha

2. Source Code Compilation

# Install PaddlePaddle nightly version
python -m pip install --pre paddlepaddle-gpu -i https://www.paddlepaddle.org.cn/packages/nightly/cu126/

# Compile FastDeploy
cd FastDeploy
bash build.sh

# Install
pip install dist/fastdeploy-2.0.0a0-py3-none-any.whl

Quick Deployment Example

Qwen2 Model Deployment

# Download model
wget https://fastdeploy.bj.bcebos.com/llm/models/Qwen2-7B-Instruct.tar.gz && tar xvf Qwen2-7B-Instruct.tar.gz

# Start service
python -m fastdeploy.entrypoints.openai.api_server --model ./Qwen2-7B-Instruct --port 8188 --tensor-parallel-size 1

API Call Example

curl -X POST "http://0.0.0.0:8188/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "你好，你的名字是什么？"}
  ]
}'

Technical Features

🎛️ Advanced Features

Tensor Parallelism: Supports distributed inference of large models.
Dynamic Batching: Continuous batching technology improves throughput.
Memory Optimization: PagedAttention reduces memory footprint.
Model Compression: Weight only quantization technology.

🔗 Protocol Compatibility

OpenAI Compatibility: Fully compatible with the OpenAI API protocol.
Multi-Language SDK: Supports multiple programming languages such as Python and C++.
Monitoring Integration: Built-in Prometheus metrics monitoring.

Version Description

Current Version Strategy

FastDeploy 2.0: Focuses on large language model deployment.
FastDeploy 1.1.0: Continues to support traditional CV models (PaddleClas, PaddleOCR, etc.).

Summary

As an important part of the Baidu PaddlePaddle ecosystem, FastDeploy is committed to creating industry-leading AI model deployment solutions. Through continuous technological innovation and community building, it provides developers with a complete toolchain from model training to production deployment, promoting the popularization and application of AI technology.