neuralmagic/deepsparsePlease refer to the latest official releases for information GitHub Homepage

A sparsity-aware deep learning inference runtime designed specifically for CPUs.

NOASSERTIONPython 3.2kneuralmagicdeepsparse Last Updated: 2025-06-02

DeepSparse - Sparse-Aware Deep Learning Inference Engine Designed for CPU

Project Overview

DeepSparse, developed by Neural Magic, is a revolutionary CPU inference runtime specifically designed to leverage the sparsity of neural networks to accelerate the deep learning model inference process. This project achieves exceptional inference performance on CPU hardware by combining with the SparseML optimization library.

Important Update: In January 2025, Neural Magic was acquired by Red Hat, and the DeepSparse community version will be discontinued and deprecated on June 2, 2025. The team will transition to commercial and open-source solutions based on vLLM.

Core Features

1. Sparsity Optimization

Sparse Kernel Support: Achieves acceleration and memory savings through unstructured sparse weights.
8-bit Quantization: Supports 8-bit quantization for weights and activations.
Cache Optimization: Efficiently utilizes cached attention key-value pairs, minimizing memory movement.

2. Large Language Model (LLM) Support

DeepSparse provides initial support for large language model inference, including:

Sparse fine-tuning techniques for MPT-7B models.
7x acceleration compared to dense baselines (sparse-quantized models).
Support for models with up to 60% sparsity without loss of accuracy.

3. Broad Model Support

Computer Vision: ResNet, EfficientNet, YOLOv5/8, ViT, etc.
Natural Language Processing: BERT, Transformer variants, etc.
Multimodal Models: Supports various CNN and Transformer architectures.

System Requirements

Hardware Support

x86 Architecture: AVX2, AVX-512, AVX-512 VNNI
ARM Architecture: v8.2+

Software Environment

Operating System: Linux
Python Version: 3.8-3.11
ONNX Support: Versions 1.5.0-1.15.0, opset version 11 or higher

Note: Mac and Windows users are recommended to use Docker Linux containers

Installation

Stable Version

pip install deepsparse

Nightly Build (Includes Latest Features)

pip install deepsparse-nightly

LLM Support Version

pip install -U deepsparse-nightly[llm]

Install from Source

pip install -e path/to/deepsparse

Three Deployment APIs

1. Engine API (Low-Level API)

The lowest-level API, directly compiles ONNX models and handles tensor input/output.

from deepsparse import Engine

# Download and compile the model
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)

# Run inference
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)

2. Pipeline API (Mid-Level API)

Wraps the Engine and adds preprocessing and postprocessing functionality, allowing direct processing of raw data.

from deepsparse import Pipeline

# Set up the pipeline
sentiment_analysis_pipeline = Pipeline.create(
    task="sentiment-analysis",
    model_path="zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
)

# Run inference
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# Output: labels=['positive'] scores=[0.9954759478569031]

3. Server API (High-Level API)

Wraps the Pipeline based on FastAPI, providing REST API services.

# Start the server
deepsparse.server \
    --task sentiment-analysis \
    --model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none

# Send a request
import requests

url = "http://localhost:5543/v2/models/sentiment_analysis/infer"
obj = {"sequences": "Snorlax loves my Tesla!"}
response = requests.post(url, json=obj)
print(response.text)
# Output: {"labels":["positive"],"scores":[0.9965094327926636]}

Large Language Model Example

from deepsparse import TextGeneration

pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")
prompt = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: what is sparsity?
### Response:
"""

result = pipeline(prompt, max_new_tokens=75)
print(result.generations[0].text)

Technical Advantages

1. Sparse Fine-Tuning Technology

Innovative technology developed in collaboration with IST Austria.
Prunes MPT-7B to 60% sparsity during fine-tuning.
Achieves significant acceleration without loss of accuracy.

2. Performance Optimization

Achieves GPU-level inference performance on CPUs.
Significantly reduces memory usage.
Supports highly optimized sparse-quantized models.

3. Ecosystem Integration

Seamless integration with the SparseML optimization library.
SparseZoo model library provides pre-optimized models.
Supports various deployment scenarios.

Use Cases

Edge Computing: Deploy high-performance AI models in resource-constrained environments.
Cloud Inference: Reduce cloud computing costs and improve inference efficiency.
Real-time Applications: Real-time AI applications requiring low latency.
Large-Scale Deployment: Production environments that need to handle high-concurrency inference requests.

Privacy and Analytics

DeepSparse collects basic usage telemetry data for product usage analysis. Users can disable this by setting an environment variable:

export NM_DISABLE_ANALYTICS=True

Academic Citations

This project is based on several important academic papers, including:

Sparse Fine-Tuning for Inference Acceleration of Large Language Models (2023)
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning (2022)
Inducing and Exploiting Activation Sparsity for Fast Inference (ICML 2020)

Summary

DeepSparse represents a significant breakthrough in CPU inference optimization. Through innovative sparsity utilization techniques, it achieves unprecedented deep learning inference performance on ordinary CPU hardware. Although the community version is nearing its end of maintenance, its technological innovations and concepts will continue to develop with the support of Red Hat, making greater contributions to the field of AI inference optimization.