deepseek-ai/DeepSeek-OCR View GitHub Homepage for Latest Official Releases

An efficient OCR model based on visual compression, capable of converting document images into Markdown format, supporting multi-resolution and multi-language recognition.

MITPythonDeepSeek-OCRdeepseek-ai 17.7k Last Updated: October 25, 2025

DeepSeek-OCR Project Details

Project Overview

DeepSeek-OCR is an innovative open-source optical character recognition model developed by the DeepSeek AI team, focusing on exploring the boundaries of visual text compression. This project investigates the role of visual encoders from a Large Language Model (LLM)-centric perspective, treating visual perception as an information compression medium. This approach enables the processing of large and complex documents with significantly fewer tokens.

Key Features

Efficient Compression: Achieves a token compression rate of 7-20x, maintaining approximately 97% decoding accuracy at 10x compression.
Multi-resolution Support: Supports various native resolutions from 512×512 to 1280×1280.
High-performance Processing: A single A100-40G GPU can generate over 200,000 pages of training data per day.
Multi-language Support: Supports text recognition for approximately 100 languages.
Versatility: Not only supports text extraction but also understands charts, chemical formulas, and simple graphics.

Technical Architecture

Model Components

DeepSeek-OCR consists of two core components:

DeepEncoder (Visual Encoder)
- Parameter Count: Approximately 380 million
- Architectural Composition:
  - SAM-ViTDet (Meta's 80-million-parameter segmentation model) for local image perception.
  - 2-layer convolutional compressor, achieving 16x token downsampling.
  - CLIP ViT-300M (OpenAI's 300-million-parameter model) for global visual knowledge aggregation.
DeepSeek3B-MoE Decoder
- Active Parameters: Approximately 570 million
- Total Parameters: 3B (Mixture-of-Experts (MoE) architecture)
- Function: Generates results based on image tokens and prompt information.

How It Works

Image Processing Workflow:
- A 1024×1024 pixel image initially generates 4096 tokens.
- The SAM module performs window attention processing.
- The compressor reduces tokens to 256 (16x compression).
- The CLIP module performs global attention processing.
- Finally, compressed visual tokens are output.
Resolution Modes:
- Native Resolution Mode:
  - Tiny: 512×512 (64 visual tokens)
  - Small: 640×640 (100 visual tokens)
  - Base: 1024×1024 (256 visual tokens)
  - Large: 1280×1280 (400 visual tokens)
- Dynamic Resolution Mode:
  - Gundam: n×640×640 + 1×1024×1024 (combining global and local views)

Performance Benchmarks

Benchmark Results

Fox Benchmark: Achieves approximately 97% decoding accuracy at a 10x compression rate.
OmniDocBench Benchmark:
- Outperforms GOT-OCR2.0 (256 tokens/page) using only 100 visual tokens.
- Outperforms MinerU2.0 (averaging over 6000 tokens/page) using fewer than 800 visual tokens.

Training and Inference Performance

Training Speed:
- Pure text data: 90B tokens per day
- Multimodal data: 70B tokens per day
Production Performance: A single A100-40G node can process over 200,000 pages per day.
Concurrent Performance: PDF processing at approximately 2500 tokens/s (A100-40G).

Application Scenarios

Key Features (Functionality)

DeepSeek-OCR supports multiple prompt modes:

# Document to Markdown
prompt = "<image>\n<|grounding|>Convert the document to markdown."

# General OCR
prompt = "<image>\n<|grounding|>OCR this image."

# Free OCR (no layout)
prompt = "<image>\nFree OCR."

# Figure Parsing
prompt = "<image>\nParse the figure."

# Detailed Image Description
prompt = "<image>\nDescribe this image in detail."

# Text Localization
prompt = "<image>\nLocate <|ref|>xxxx<|/ref|> in the image."

Practical Applications

Document Digitization: Efficiently processes academic papers, books, reports, and other documents.
Dataset Generation: Generates massive training data for Large Language Models and Vision-Language Models.
Chatbot Context Compression: Stores old conversation records at lower resolution (similar to human memory decay).
Structured Data Extraction:
- Converts financial charts into structured data.
- Automatically generates Markdown tables and graphics.
- Supports chemical formula (SMILES format) recognition.

Installation and Usage

Environment Requirements

Python 3.12.9
CUDA 11.8
PyTorch 2.6.0
Transformers 4.46.3

Installation Steps

# Clone the repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

# Create Conda environment
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

Usage Examples

Method 1: Using Transformers

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'

model_name = 'deepseek-ai/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    _attn_implementation='flash_attention_2', 
    trust_remote_code=True, 
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Configure inference parameters
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

# Execute inference
res = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=1024, 
    image_size=640, 
    crop_mode=True, 
    save_results=True, 
    test_compress=True
)

Method 2: Using vLLM (High-Performance Inference)

# Modify configuration file
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
# Edit config.py to set INPUT_PATH/OUTPUT_PATH

# Run image OCR (streaming output)
python run_dpsk_ocr_image.py

# Run PDF OCR (high concurrency)
python run_dpsk_ocr_pdf.py

# Batch evaluation
python run_dpsk_ocr_eval_batch.py

Technical Innovations

Visual Text Compression Paradigm

DeepSeek-OCR proposes a new visual text compression paradigm:

Core Idea: Converts text into images and processes them via a visual encoder, no longer storing semantics in text token form.
Advantages:
- Lower memory footprint: Visual tokens are more compact.
- Faster inference speed: Fewer tokens = less computation.
- Natural forgetting mechanism: Old context can be downsampled.
- Easier multimodal fusion: The model already treats text as images.

Distinction from Traditional OCR

Traditional OCR employs a pipeline architecture (detection → recognition → post-processing), whereas DeepSeek-OCR uses an end-to-end vision-language model architecture, fundamentally simplifying the OCR system.

Resource Links

GitHub Repository: https://github.com/deepseek-ai/DeepSeek-OCR
Hugging Face Model: https://huggingface.co/deepseek-ai/DeepSeek-OCR
Technical Paper: DeepSeek_OCR_paper.pdf
License: MIT License

Acknowledgements

The DeepSeek-OCR project thanks the following open-source projects for their contributions:

Vary
GOT-OCR2.0
MinerU
PaddleOCR
OneChart
Slow Perception

And benchmark datasets: Fox and OmniDocBench.

Conclusion

DeepSeek-OCR represents a significant innovation in OCR technology, addressing the challenges of long-context processing in Large Language Models through a visual compression paradigm. Its efficient token compression capability (7-20x), excellent accuracy (97% accuracy at 10x compression), and powerful processing capacity (200,000 pages per day on a single GPU) make it an ideal choice for document digitization, AI training data generation, and multimodal applications.

The open-source nature of this project and its comprehensive documentation make it easy to integrate into various application scenarios, providing researchers and developers with a powerful tool to explore the boundaries of visual text compression.