An efficient OCR model based on visual compression, capable of converting document images into Markdown format, supporting multi-resolution and multi-language recognition.
DeepSeek-OCR Project Details
Project Overview
DeepSeek-OCR is an innovative open-source optical character recognition model developed by the DeepSeek AI team, focusing on exploring the boundaries of visual text compression. This project investigates the role of visual encoders from a Large Language Model (LLM)-centric perspective, treating visual perception as an information compression medium. This approach enables the processing of large and complex documents with significantly fewer tokens.
Key Features
- Efficient Compression: Achieves a token compression rate of 7-20x, maintaining approximately 97% decoding accuracy at 10x compression.
- Multi-resolution Support: Supports various native resolutions from 512×512 to 1280×1280.
- High-performance Processing: A single A100-40G GPU can generate over 200,000 pages of training data per day.
- Multi-language Support: Supports text recognition for approximately 100 languages.
- Versatility: Not only supports text extraction but also understands charts, chemical formulas, and simple graphics.
Technical Architecture
Model Components
DeepSeek-OCR consists of two core components:
DeepEncoder (Visual Encoder)
- Parameter Count: Approximately 380 million
- Architectural Composition:
- SAM-ViTDet (Meta's 80-million-parameter segmentation model) for local image perception.
- 2-layer convolutional compressor, achieving 16x token downsampling.
- CLIP ViT-300M (OpenAI's 300-million-parameter model) for global visual knowledge aggregation.
DeepSeek3B-MoE Decoder
- Active Parameters: Approximately 570 million
- Total Parameters: 3B (Mixture-of-Experts (MoE) architecture)
- Function: Generates results based on image tokens and prompt information.
How It Works
Image Processing Workflow:
- A 1024×1024 pixel image initially generates 4096 tokens.
- The SAM module performs window attention processing.
- The compressor reduces tokens to 256 (16x compression).
- The CLIP module performs global attention processing.
- Finally, compressed visual tokens are output.
Resolution Modes:
Native Resolution Mode:
- Tiny: 512×512 (64 visual tokens)
- Small: 640×640 (100 visual tokens)
- Base: 1024×1024 (256 visual tokens)
- Large: 1280×1280 (400 visual tokens)
Dynamic Resolution Mode:
- Gundam: n×640×640 + 1×1024×1024 (combining global and local views)
Performance Benchmarks
Benchmark Results
- Fox Benchmark: Achieves approximately 97% decoding accuracy at a 10x compression rate.
- OmniDocBench Benchmark:
- Outperforms GOT-OCR2.0 (256 tokens/page) using only 100 visual tokens.
- Outperforms MinerU2.0 (averaging over 6000 tokens/page) using fewer than 800 visual tokens.
Training and Inference Performance
- Training Speed:
- Pure text data: 90B tokens per day
- Multimodal data: 70B tokens per day
- Production Performance: A single A100-40G node can process over 200,000 pages per day.
- Concurrent Performance: PDF processing at approximately 2500 tokens/s (A100-40G).
Application Scenarios
Key Features (Functionality)
DeepSeek-OCR supports multiple prompt modes:
# Document to Markdown
prompt = "<image>\n<|grounding|>Convert the document to markdown."
# General OCR
prompt = "<image>\n<|grounding|>OCR this image."
# Free OCR (no layout)
prompt = "<image>\nFree OCR."
# Figure Parsing
prompt = "<image>\nParse the figure."
# Detailed Image Description
prompt = "<image>\nDescribe this image in detail."
# Text Localization
prompt = "<image>\nLocate <|ref|>xxxx<|/ref|> in the image."
Practical Applications
- Document Digitization: Efficiently processes academic papers, books, reports, and other documents.
- Dataset Generation: Generates massive training data for Large Language Models and Vision-Language Models.
- Chatbot Context Compression: Stores old conversation records at lower resolution (similar to human memory decay).
- Structured Data Extraction:
- Converts financial charts into structured data.
- Automatically generates Markdown tables and graphics.
- Supports chemical formula (SMILES format) recognition.
Installation and Usage
Environment Requirements
- Python 3.12.9
- CUDA 11.8
- PyTorch 2.6.0
- Transformers 4.46.3
Installation Steps
# Clone the repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
# Create Conda environment
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr
# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
Usage Examples
Method 1: Using Transformers
from transformers import AutoModel, AutoTokenizer
import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation='flash_attention_2',
trust_remote_code=True,
use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)
# Configure inference parameters
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'your/output/dir'
# Execute inference
res = model.infer(
tokenizer,
prompt=prompt,
image_file=image_file,
output_path=output_path,
base_size=1024,
image_size=640,
crop_mode=True,
save_results=True,
test_compress=True
)
Method 2: Using vLLM (High-Performance Inference)
# Modify configuration file
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
# Edit config.py to set INPUT_PATH/OUTPUT_PATH
# Run image OCR (streaming output)
python run_dpsk_ocr_image.py
# Run PDF OCR (high concurrency)
python run_dpsk_ocr_pdf.py
# Batch evaluation
python run_dpsk_ocr_eval_batch.py
Technical Innovations
Visual Text Compression Paradigm
DeepSeek-OCR proposes a new visual text compression paradigm:
- Core Idea: Converts text into images and processes them via a visual encoder, no longer storing semantics in text token form.
- Advantages:
- Lower memory footprint: Visual tokens are more compact.
- Faster inference speed: Fewer tokens = less computation.
- Natural forgetting mechanism: Old context can be downsampled.
- Easier multimodal fusion: The model already treats text as images.
Distinction from Traditional OCR
Traditional OCR employs a pipeline architecture (detection → recognition → post-processing), whereas DeepSeek-OCR uses an end-to-end vision-language model architecture, fundamentally simplifying the OCR system.
Resource Links
- GitHub Repository: https://github.com/deepseek-ai/DeepSeek-OCR
- Hugging Face Model: https://huggingface.co/deepseek-ai/DeepSeek-OCR
- Technical Paper: DeepSeek_OCR_paper.pdf
- License: MIT License
Acknowledgements
The DeepSeek-OCR project thanks the following open-source projects for their contributions:
- Vary
- GOT-OCR2.0
- MinerU
- PaddleOCR
- OneChart
- Slow Perception
And benchmark datasets: Fox and OmniDocBench.
Conclusion
DeepSeek-OCR represents a significant innovation in OCR technology, addressing the challenges of long-context processing in Large Language Models through a visual compression paradigm. Its efficient token compression capability (7-20x), excellent accuracy (97% accuracy at 10x compression), and powerful processing capacity (200,000 pages per day on a single GPU) make it an ideal choice for document digitization, AI training data generation, and multimodal applications.
The open-source nature of this project and its comprehensive documentation make it easy to integrate into various application scenarios, providing researchers and developers with a powerful tool to explore the boundaries of visual text compression.