PowerInfer - High-Speed Large Language Model Inference Engine
Project Overview
PowerInfer is a high-speed Large Language Model (LLM) inference engine developed by the IPADS Lab at Shanghai Jiao Tong University, designed specifically for personal computers equipped with a single consumer-grade GPU. The core innovation of this project lies in leveraging the inherent high locality characteristics of LLM inference, optimizing inference performance through power-law distributed neuron activation patterns.
Project Background
Traditional LLM inference faces significant computational and memory challenges, especially when deploying large models on consumer-grade hardware. PowerInfer addresses this by deeply analyzing neural network activation patterns, revealing a key insight: a small number of "hot" neurons are consistently activated across all inputs, while the majority of "cold" neurons vary depending on the specific input.
Core Technical Principles
Hot-Cold Neuron Mechanism
PowerInfer's design is based on the following core observations:
- Hot Neurons: A small number of neurons that are consistently activated across all inputs.
- Cold Neurons: The majority of neurons that change based on specific inputs.
- Power-Law Distribution: Neuron activation follows a power-law distribution pattern.
GPU-CPU Hybrid Architecture
Based on the characteristics of hot and cold neurons, PowerInfer employs an innovative hybrid inference strategy:
- GPU: Pre-loads hot-activated neurons for fast access.
- CPU: Computes cold-activated neurons, significantly reducing GPU memory requirements.
- Intelligent Scheduling: Greatly reduces CPU-GPU data transfer overhead.
Core Features
🚀 High-Performance Inference
- Speed Performance: Achieves an average token generation speed of 13.20 tokens/s, with peaks up to 29.08 tokens/s.
- Performance Comparison: Up to 11.69x performance improvement compared to llama.cpp.
- Hardware Efficiency: Performance on an RTX 4090 is only 18% lower than a server-grade A100 GPU.
🧠 Intelligent Optimization Technologies
- Adaptive Predictor: Dynamically optimizes neuron activation prediction.
- Neuron-Aware Sparse Operators: Optimizes computational sparsity efficiency.
- Locality-Centric Design: Fully leverages sparse activation characteristics.
🔧 Ease of Use and Compatibility
- Easy Integration: Compatible with popular ReLU sparse models.
- Local Deployment: Deeply optimized for consumer-grade hardware.
- Backward Compatibility: Supports most llama.cpp usage patterns.
Supported Models
Currently Supported Model Families
Model Family |
Parameter Size |
Features |
Falcon Series |
40B |
ReLU activation function optimization |
Llama2 Series |
7B/13B/70B |
Full series support |
ProSparse Llama2 |
7B/13B |
~90% sparsity, performance close to original |
Bamboo Series |
7B |
Top performance and speed coexist |
Model Format
PowerInfer uses a specialized PowerInfer GGUF format, which includes:
- LLM weights
- Predictor weights
- Activation statistics
Technical Architecture
System Design
┌─────────────────┐ ┌─────────────────┐
│ Hot Neurons │───▶│ GPU │
│ (Always Active) │ │ (Fast Access) │
└─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ Cold Neurons │───▶│ CPU │
│ (Conditionally Active)│ │ (Flexible Computation)│
└─────────────────┘ └─────────────────┘
Core Components
- Activation Predictor: Intelligently predicts neuron activation patterns.
- Memory Manager: Optimizes GPU/CPU memory allocation.
- Sparse Operators: Efficiently handles sparse computations.
- Scheduler: Intelligently allocates computational tasks.
Platform Support
Tested Platforms
- Linux: x86-64 CPU with AVX2 instruction set, supports NVIDIA GPU.
- Windows: x86-64 CPU with AVX2 instruction set, supports NVIDIA GPU.
- macOS: Apple M-series chips (CPU only, limited performance improvement).
- AMD GPU: Supported via ROCm.
Hardware Requirements
- CPU: x86-64 processor supporting the AVX2 instruction set.
- GPU: NVIDIA RTX series or AMD GPU (optional).
- Memory: Depends on the model size.
- Storage: Sufficient space to store model files.
Performance Benchmarks
RTX 4090 Performance
Model |
PowerInfer |
llama.cpp |
Speedup |
Falcon-40B |
11.2 tokens/s |
1.0 tokens/s |
11.2x |
Llama2-70B |
8.1 tokens/s |
2.7 tokens/s |
3.0x |
Llama2-13B |
24.8 tokens/s |
8.9 tokens/s |
2.8x |
RTX 2080Ti Performance (INT4 Quantization)
Model |
PowerInfer |
llama.cpp |
Speedup |
Falcon-40B |
6.8 tokens/s |
0.85 tokens/s |
8.0x |
Llama2-70B |
5.2 tokens/s |
1.7 tokens/s |
3.1x |
Installation and Usage
Environment Requirements
- CMake (3.17+)
- Python (3.8+) and pip (19.3+)
- CUDA toolchain (if using NVIDIA GPU)
Basic Installation
git clone https://github.com/SJTU-IPADS/PowerInfer
cd PowerInfer
pip install -r requirements.txt
# NVIDIA GPU
cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release
# CPU only
cmake -S . -B build
cmake --build build --config Release
Model Download
# Download model using huggingface-cli
huggingface-cli download --resume-download --local-dir ReluLLaMA-7B \
--local-dir-use-symlinks False PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF
Running Inference
# Basic inference
./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf \
-n 128 -t 8 -p "Once upon a time"
# Limit VRAM usage
./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf \
-n 128 -t 8 -p "Once upon a time" --vram-budget 8
Latest Updates and Developments
Technical Innovations
- PowerInfer-2: Mobile-optimized version, achieving 11.68 tokens/s on smartphones.
- TurboSparse: Low-cost sparsification technology, significantly reducing parameters while maintaining performance.
- Bamboo LLM: A self-developed model series that balances performance and speed.
Application Scenarios
Suitable Scenarios
- Personal AI Assistant: Deploying a private AI assistant locally.
- Enterprise Internal Applications: Internal AI services that protect data privacy.
- Research and Development: Rapid prototyping and model testing.
- Edge Computing: Deploying LLMs in resource-constrained environments.
- Educational Research: Learning and researching large model inference techniques.
Advantages and Features
- Privacy Protection: All computations are performed locally.
- Cost-Effectiveness: Excellent performance can be achieved using consumer-grade hardware.
- Simple Deployment: No complex distributed system configuration required.
- Fast Response: Low-latency local inference.
Technical Comparison
vs. Traditional Inference Engines
Feature |
PowerInfer |
Traditional Engines |
Hardware Requirements |
Consumer-grade GPU |
Server-grade GPU |
Memory Efficiency |
Hybrid CPU/GPU |
Full GPU Loading |
Inference Speed |
11.69x Improvement |
Baseline Performance |
Cost |
Low Cost |
High Cost |
vs. llama.cpp
- Performance: Up to 11.69x speed improvement.
- Memory: More efficient memory utilization.
- Hardware: Better CPU/GPU coordination.
- Compatibility: Supports most llama.cpp features.
Technical Principles in Depth
Sparsity Utilization
The core of PowerInfer lies in the deep utilization of neural network sparsity:
- Activation Pattern Analysis: Discovering the power-law distribution of neuron activation through extensive data analysis.
- Prediction Mechanism: Using a lightweight predictor to predict neuron activation states.
- Dynamic Scheduling: Dynamically allocating computing resources based on prediction results.
Memory Optimization Strategies
- Tiered Storage: Hot data is stored on the GPU, and cold data is stored on the CPU.
- Prefetching Mechanism: Intelligently prefetching potentially needed data.
- Compression Technology: Compressing cold data for storage.
Development and Contribution
Open Source License
PowerInfer uses an open-source license and welcomes community contributions. The project actively accepts issue feedback and feature suggestions.
Development Team
- IPADS Lab, Shanghai Jiao Tong University: Main development team.
- THUNLP, Tsinghua University: ReLU sparse model support.
- Open Source Community: Continuous contributions and improvements.
Academic Impact
Related research papers have been published, providing an important theoretical foundation and practical guidance for the field of large language model inference optimization.
Summary
PowerInfer represents a significant breakthrough in local inference technology for large language models. Through its innovative hot-cold neuron mechanism and CPU/GPU hybrid architecture, it successfully achieves near server-level inference performance on consumer-grade hardware.