Home
Login
SJTU-IPADS/PowerInfer

PowerInfer is a high-speed large language model inference engine designed for local deployment, leveraging sparse activation and a CPU/GPU hybrid architecture to achieve fast LLM inference on consumer-grade hardware.

MITC++ 8.2kSJTU-IPADS Last Updated: 2025-02-19
https://github.com/SJTU-IPADS/PowerInfer

PowerInfer - High-Speed Large Language Model Inference Engine

Project Overview

PowerInfer is a high-speed Large Language Model (LLM) inference engine developed by the IPADS Lab at Shanghai Jiao Tong University, designed specifically for personal computers equipped with a single consumer-grade GPU. The core innovation of this project lies in leveraging the inherent high locality characteristics of LLM inference, optimizing inference performance through power-law distributed neuron activation patterns.

Project Background

Traditional LLM inference faces significant computational and memory challenges, especially when deploying large models on consumer-grade hardware. PowerInfer addresses this by deeply analyzing neural network activation patterns, revealing a key insight: a small number of "hot" neurons are consistently activated across all inputs, while the majority of "cold" neurons vary depending on the specific input.

Core Technical Principles

Hot-Cold Neuron Mechanism

PowerInfer's design is based on the following core observations:

  • Hot Neurons: A small number of neurons that are consistently activated across all inputs.
  • Cold Neurons: The majority of neurons that change based on specific inputs.
  • Power-Law Distribution: Neuron activation follows a power-law distribution pattern.

GPU-CPU Hybrid Architecture

Based on the characteristics of hot and cold neurons, PowerInfer employs an innovative hybrid inference strategy:

  • GPU: Pre-loads hot-activated neurons for fast access.
  • CPU: Computes cold-activated neurons, significantly reducing GPU memory requirements.
  • Intelligent Scheduling: Greatly reduces CPU-GPU data transfer overhead.

Core Features

🚀 High-Performance Inference

  • Speed Performance: Achieves an average token generation speed of 13.20 tokens/s, with peaks up to 29.08 tokens/s.
  • Performance Comparison: Up to 11.69x performance improvement compared to llama.cpp.
  • Hardware Efficiency: Performance on an RTX 4090 is only 18% lower than a server-grade A100 GPU.

🧠 Intelligent Optimization Technologies

  • Adaptive Predictor: Dynamically optimizes neuron activation prediction.
  • Neuron-Aware Sparse Operators: Optimizes computational sparsity efficiency.
  • Locality-Centric Design: Fully leverages sparse activation characteristics.

🔧 Ease of Use and Compatibility

  • Easy Integration: Compatible with popular ReLU sparse models.
  • Local Deployment: Deeply optimized for consumer-grade hardware.
  • Backward Compatibility: Supports most llama.cpp usage patterns.

Supported Models

Currently Supported Model Families

Model Family Parameter Size Features
Falcon Series 40B ReLU activation function optimization
Llama2 Series 7B/13B/70B Full series support
ProSparse Llama2 7B/13B ~90% sparsity, performance close to original
Bamboo Series 7B Top performance and speed coexist

Model Format

PowerInfer uses a specialized PowerInfer GGUF format, which includes:

  • LLM weights
  • Predictor weights
  • Activation statistics

Technical Architecture

System Design

┌─────────────────┐    ┌─────────────────┐
│   Hot Neurons     │───▶│      GPU        │
│  (Always Active)  │    │   (Fast Access)  │
└─────────────────┘    └─────────────────┘
                              │
                              ▼
┌─────────────────┐    ┌─────────────────┐
│   Cold Neurons    │───▶│      CPU        │
│  (Conditionally Active)│    │   (Flexible Computation)│
└─────────────────┘    └─────────────────┘

Core Components

  1. Activation Predictor: Intelligently predicts neuron activation patterns.
  2. Memory Manager: Optimizes GPU/CPU memory allocation.
  3. Sparse Operators: Efficiently handles sparse computations.
  4. Scheduler: Intelligently allocates computational tasks.

Platform Support

Tested Platforms

  • Linux: x86-64 CPU with AVX2 instruction set, supports NVIDIA GPU.
  • Windows: x86-64 CPU with AVX2 instruction set, supports NVIDIA GPU.
  • macOS: Apple M-series chips (CPU only, limited performance improvement).
  • AMD GPU: Supported via ROCm.

Hardware Requirements

  • CPU: x86-64 processor supporting the AVX2 instruction set.
  • GPU: NVIDIA RTX series or AMD GPU (optional).
  • Memory: Depends on the model size.
  • Storage: Sufficient space to store model files.

Performance Benchmarks

RTX 4090 Performance

Model PowerInfer llama.cpp Speedup
Falcon-40B 11.2 tokens/s 1.0 tokens/s 11.2x
Llama2-70B 8.1 tokens/s 2.7 tokens/s 3.0x
Llama2-13B 24.8 tokens/s 8.9 tokens/s 2.8x

RTX 2080Ti Performance (INT4 Quantization)

Model PowerInfer llama.cpp Speedup
Falcon-40B 6.8 tokens/s 0.85 tokens/s 8.0x
Llama2-70B 5.2 tokens/s 1.7 tokens/s 3.1x

Installation and Usage

Environment Requirements

  • CMake (3.17+)
  • Python (3.8+) and pip (19.3+)
  • CUDA toolchain (if using NVIDIA GPU)

Basic Installation

git clone https://github.com/SJTU-IPADS/PowerInfer
cd PowerInfer
pip install -r requirements.txt

# NVIDIA GPU
cmake -S . -B build -DLLAMA_CUBLAS=ON
cmake --build build --config Release

# CPU only
cmake -S . -B build
cmake --build build --config Release

Model Download

# Download model using huggingface-cli
huggingface-cli download --resume-download --local-dir ReluLLaMA-7B \
  --local-dir-use-symlinks False PowerInfer/ReluLLaMA-7B-PowerInfer-GGUF

Running Inference

# Basic inference
./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf \
  -n 128 -t 8 -p "Once upon a time"

# Limit VRAM usage
./build/bin/main -m ./ReluLLaMA-7B-PowerInfer-GGUF/llama-7b-relu.powerinfer.gguf \
  -n 128 -t 8 -p "Once upon a time" --vram-budget 8

Latest Updates and Developments

Technical Innovations

  1. PowerInfer-2: Mobile-optimized version, achieving 11.68 tokens/s on smartphones.
  2. TurboSparse: Low-cost sparsification technology, significantly reducing parameters while maintaining performance.
  3. Bamboo LLM: A self-developed model series that balances performance and speed.

Application Scenarios

Suitable Scenarios

  • Personal AI Assistant: Deploying a private AI assistant locally.
  • Enterprise Internal Applications: Internal AI services that protect data privacy.
  • Research and Development: Rapid prototyping and model testing.
  • Edge Computing: Deploying LLMs in resource-constrained environments.
  • Educational Research: Learning and researching large model inference techniques.

Advantages and Features

  • Privacy Protection: All computations are performed locally.
  • Cost-Effectiveness: Excellent performance can be achieved using consumer-grade hardware.
  • Simple Deployment: No complex distributed system configuration required.
  • Fast Response: Low-latency local inference.

Technical Comparison

vs. Traditional Inference Engines

Feature PowerInfer Traditional Engines
Hardware Requirements Consumer-grade GPU Server-grade GPU
Memory Efficiency Hybrid CPU/GPU Full GPU Loading
Inference Speed 11.69x Improvement Baseline Performance
Cost Low Cost High Cost

vs. llama.cpp

  • Performance: Up to 11.69x speed improvement.
  • Memory: More efficient memory utilization.
  • Hardware: Better CPU/GPU coordination.
  • Compatibility: Supports most llama.cpp features.

Technical Principles in Depth

Sparsity Utilization

The core of PowerInfer lies in the deep utilization of neural network sparsity:

  1. Activation Pattern Analysis: Discovering the power-law distribution of neuron activation through extensive data analysis.
  2. Prediction Mechanism: Using a lightweight predictor to predict neuron activation states.
  3. Dynamic Scheduling: Dynamically allocating computing resources based on prediction results.

Memory Optimization Strategies

  • Tiered Storage: Hot data is stored on the GPU, and cold data is stored on the CPU.
  • Prefetching Mechanism: Intelligently prefetching potentially needed data.
  • Compression Technology: Compressing cold data for storage.

Development and Contribution

Open Source License

PowerInfer uses an open-source license and welcomes community contributions. The project actively accepts issue feedback and feature suggestions.

Development Team

  • IPADS Lab, Shanghai Jiao Tong University: Main development team.
  • THUNLP, Tsinghua University: ReLU sparse model support.
  • Open Source Community: Continuous contributions and improvements.

Academic Impact

Related research papers have been published, providing an important theoretical foundation and practical guidance for the field of large language model inference optimization.

Summary

PowerInfer represents a significant breakthrough in local inference technology for large language models. Through its innovative hot-cold neuron mechanism and CPU/GPU hybrid architecture, it successfully achieves near server-level inference performance on consumer-grade hardware.