DeepSparse, developed by Neural Magic, is a revolutionary CPU inference runtime specifically designed to leverage the sparsity of neural networks to accelerate the deep learning model inference process. This project achieves exceptional inference performance on CPU hardware by combining with the SparseML optimization library.
Important Update: In January 2025, Neural Magic was acquired by Red Hat, and the DeepSparse community version will be discontinued and deprecated on June 2, 2025. The team will transition to commercial and open-source solutions based on vLLM.
DeepSparse provides initial support for large language model inference, including:
Note: Mac and Windows users are recommended to use Docker Linux containers
pip install deepsparse
pip install deepsparse-nightly
pip install -U deepsparse-nightly[llm]
pip install -e path/to/deepsparse
The lowest-level API, directly compiles ONNX models and handles tensor input/output.
from deepsparse import Engine
# Download and compile the model
zoo_stub = "zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
compiled_model = Engine(model=zoo_stub, batch_size=1)
# Run inference
inputs = compiled_model.generate_random_inputs()
output = compiled_model(inputs)
Wraps the Engine and adds preprocessing and postprocessing functionality, allowing direct processing of raw data.
from deepsparse import Pipeline
# Set up the pipeline
sentiment_analysis_pipeline = Pipeline.create(
task="sentiment-analysis",
model_path="zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none"
)
# Run inference
prediction = sentiment_analysis_pipeline("I love using DeepSparse Pipelines")
print(prediction)
# Output: labels=['positive'] scores=[0.9954759478569031]
Wraps the Pipeline based on FastAPI, providing REST API services.
# Start the server
deepsparse.server \
--task sentiment-analysis \
--model_path zoo:nlp/sentiment_analysis/obert-base/pytorch/huggingface/sst2/pruned90_quant-none
# Send a request
import requests
url = "http://localhost:5543/v2/models/sentiment_analysis/infer"
obj = {"sequences": "Snorlax loves my Tesla!"}
response = requests.post(url, json=obj)
print(response.text)
# Output: {"labels":["positive"],"scores":[0.9965094327926636]}
from deepsparse import TextGeneration
pipeline = TextGeneration(model="zoo:mpt-7b-dolly_mpt_pretrain-pruned50_quantized")
prompt = """
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: what is sparsity?
### Response:
"""
result = pipeline(prompt, max_new_tokens=75)
print(result.generations[0].text)
DeepSparse collects basic usage telemetry data for product usage analysis. Users can disable this by setting an environment variable:
export NM_DISABLE_ANALYTICS=True
This project is based on several important academic papers, including:
DeepSparse represents a significant breakthrough in CPU inference optimization. Through innovative sparsity utilization techniques, it achieves unprecedented deep learning inference performance on ordinary CPU hardware. Although the community version is nearing its end of maintenance, its technological innovations and concepts will continue to develop with the support of Red Hat, making greater contributions to the field of AI inference optimization.