A sparsity-aware deep learning inference runtime designed specifically for CPUs.
DeepSparse - Sparse-Aware Deep Learning Inference Engine Designed for CPU
Project Overview
DeepSparse, developed by Neural Magic, is a revolutionary CPU inference runtime specifically designed to leverage the sparsity of neural networks to accelerate the deep learning model inference process. This project achieves exceptional inference performance on CPU hardware by combining with the SparseML optimization library.
Important Update: In January 2025, Neural Magic was acquired by Red Hat, and the DeepSparse community version will be discontinued and deprecated on June 2, 2025. The team will transition to commercial and open-source solutions based on vLLM.
Core Features
1. Sparsity Optimization
- Sparse Kernel Support: Achieves acceleration and memory savings through unstructured sparse weights.
- 8-bit Quantization: Supports 8-bit quantization for weights and activations.
- Cache Optimization: Efficiently utilizes cached attention key-value pairs, minimizing memory movement.
2. Large Language Model (LLM) Support
DeepSparse provides initial support for large language model inference, including:
- Sparse fine-tuning techniques for MPT-7B models.
- 7x acceleration compared to dense baselines (sparse-quantized models).
- Support for models with up to 60% sparsity without loss of accuracy.
3. Broad Model Support
- Computer Vision: ResNet, EfficientNet, YOLOv5/8, ViT, etc.
- Natural Language Processing: BERT, Transformer variants, etc.
- Multimodal Models: Supports various CNN and Transformer architectures.
System Requirements
Hardware Support
- x86 Architecture: AVX2, AVX-512, AVX-512 VNNI
- ARM Architecture: v8.2+
Software Environment
- Operating System: Linux
- Python Version: 3.8-3.11
- ONNX Support: Versions 1.5.0-1.15.0, opset version 11 or higher
Note: Mac and Windows users are recommended to use Docker Linux containers
Installation
Stable Version
pip install deepsparse
Nightly Build (Includes Latest Features)
pip install deepsparse-nightly
LLM Support Version
pip install -U deepsparse-nightly[llm]
Install from Source
pip install -e path/to/deepsparse
Three Deployment APIs
1. Engine API (Low-Level API)
The lowest-level API, directly compiles ONNX models and handles tensor input/output.
from deepsparse import Engine
# Download and compile the model
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)
# Run inference
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
2. Pipeline API (Mid-Level API)
Wraps the Engine and adds preprocessing and postprocessing functionality, allowing direct processing of raw data.
from deepsparse import Pipeline
# Set up the pipeline
sentiment_analysis_pipeline = Pipeline.create(
task="sentiment-analysis",
model_path="zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
)
# Run inference
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# Output: labels=['positive'] scores=[0.9954759478569031]
3. Server API (High-Level API)
Wraps the Pipeline based on FastAPI, providing REST API services.
# Start the server
deepsparse.server \
--task sentiment-analysis \
--model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none
# Send a request
import requests
url = "http://localhost:5543/v2/models/sentiment_analysis/infer"
obj = {"sequences": "Snorlax loves my Tesla!"}
response = requests.post(url, json=obj)
print(response.text)
# Output: {"labels":["positive"],"scores":[0.9965094327926636]}
Large Language Model Example
from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")
prompt = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: what is sparsity?
### Response:
"""
result = pipeline(prompt, max_new_tokens=75)
print(result.generations[0].text)
Technical Advantages
1. Sparse Fine-Tuning Technology
- Innovative technology developed in collaboration with IST Austria.
- Prunes MPT-7B to 60% sparsity during fine-tuning.
- Achieves significant acceleration without loss of accuracy.
2. Performance Optimization
- Achieves GPU-level inference performance on CPUs.
- Significantly reduces memory usage.
- Supports highly optimized sparse-quantized models.
3. Ecosystem Integration
- Seamless integration with the SparseML optimization library.
- SparseZoo model library provides pre-optimized models.
- Supports various deployment scenarios.
Use Cases
- Edge Computing: Deploy high-performance AI models in resource-constrained environments.
- Cloud Inference: Reduce cloud computing costs and improve inference efficiency.
- Real-time Applications: Real-time AI applications requiring low latency.
- Large-Scale Deployment: Production environments that need to handle high-concurrency inference requests.
Privacy and Analytics
DeepSparse collects basic usage telemetry data for product usage analysis. Users can disable this by setting an environment variable:
export NM_DISABLE_ANALYTICS=True
Academic Citations
This project is based on several important academic papers, including:
- Sparse Fine-Tuning for Inference Acceleration of Large Language Models (2023)
- The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning (2022)
- Inducing and Exploiting Activation Sparsity for Fast Inference (ICML 2020)
Summary
DeepSparse represents a significant breakthrough in CPU inference optimization. Through innovative sparsity utilization techniques, it achieves unprecedented deep learning inference performance on ordinary CPU hardware. Although the community version is nearing its end of maintenance, its technological innovations and concepts will continue to develop with the support of Red Hat, making greater contributions to the field of AI inference optimization.