An efficient OCR model based on visual compression, capable of converting document images into Markdown format, supporting multi-resolution and multi-language recognition.

MITPythonDeepSeek-OCRdeepseek-ai 17.7k Last Updated: October 25, 2025

DeepSeek-OCR Project Details

Project Overview

DeepSeek-OCR is an innovative open-source optical character recognition model developed by the DeepSeek AI team, focusing on exploring the boundaries of visual text compression. This project investigates the role of visual encoders from a Large Language Model (LLM)-centric perspective, treating visual perception as an information compression medium. This approach enables the processing of large and complex documents with significantly fewer tokens.

Key Features

  • Efficient Compression: Achieves a token compression rate of 7-20x, maintaining approximately 97% decoding accuracy at 10x compression.
  • Multi-resolution Support: Supports various native resolutions from 512×512 to 1280×1280.
  • High-performance Processing: A single A100-40G GPU can generate over 200,000 pages of training data per day.
  • Multi-language Support: Supports text recognition for approximately 100 languages.
  • Versatility: Not only supports text extraction but also understands charts, chemical formulas, and simple graphics.

Technical Architecture

Model Components

DeepSeek-OCR consists of two core components:

  1. DeepEncoder (Visual Encoder)

    • Parameter Count: Approximately 380 million
    • Architectural Composition:
      • SAM-ViTDet (Meta's 80-million-parameter segmentation model) for local image perception.
      • 2-layer convolutional compressor, achieving 16x token downsampling.
      • CLIP ViT-300M (OpenAI's 300-million-parameter model) for global visual knowledge aggregation.
  2. DeepSeek3B-MoE Decoder

    • Active Parameters: Approximately 570 million
    • Total Parameters: 3B (Mixture-of-Experts (MoE) architecture)
    • Function: Generates results based on image tokens and prompt information.

How It Works

  1. Image Processing Workflow:

    • A 1024×1024 pixel image initially generates 4096 tokens.
    • The SAM module performs window attention processing.
    • The compressor reduces tokens to 256 (16x compression).
    • The CLIP module performs global attention processing.
    • Finally, compressed visual tokens are output.
  2. Resolution Modes:

    • Native Resolution Mode:

      • Tiny: 512×512 (64 visual tokens)
      • Small: 640×640 (100 visual tokens)
      • Base: 1024×1024 (256 visual tokens)
      • Large: 1280×1280 (400 visual tokens)
    • Dynamic Resolution Mode:

      • Gundam: n×640×640 + 1×1024×1024 (combining global and local views)

Performance Benchmarks

Benchmark Results

  • Fox Benchmark: Achieves approximately 97% decoding accuracy at a 10x compression rate.
  • OmniDocBench Benchmark:
    • Outperforms GOT-OCR2.0 (256 tokens/page) using only 100 visual tokens.
    • Outperforms MinerU2.0 (averaging over 6000 tokens/page) using fewer than 800 visual tokens.

Training and Inference Performance

  • Training Speed:
    • Pure text data: 90B tokens per day
    • Multimodal data: 70B tokens per day
  • Production Performance: A single A100-40G node can process over 200,000 pages per day.
  • Concurrent Performance: PDF processing at approximately 2500 tokens/s (A100-40G).

Application Scenarios

Key Features (Functionality)

DeepSeek-OCR supports multiple prompt modes:

# Document to Markdown
prompt = "<image>\n<|grounding|>Convert the document to markdown."

# General OCR
prompt = "<image>\n<|grounding|>OCR this image."

# Free OCR (no layout)
prompt = "<image>\nFree OCR."

# Figure Parsing
prompt = "<image>\nParse the figure."

# Detailed Image Description
prompt = "<image>\nDescribe this image in detail."

# Text Localization
prompt = "<image>\nLocate <|ref|>xxxx<|/ref|> in the image."

Practical Applications

  1. Document Digitization: Efficiently processes academic papers, books, reports, and other documents.
  2. Dataset Generation: Generates massive training data for Large Language Models and Vision-Language Models.
  3. Chatbot Context Compression: Stores old conversation records at lower resolution (similar to human memory decay).
  4. Structured Data Extraction:
    • Converts financial charts into structured data.
    • Automatically generates Markdown tables and graphics.
    • Supports chemical formula (SMILES format) recognition.

Installation and Usage

Environment Requirements

  • Python 3.12.9
  • CUDA 11.8
  • PyTorch 2.6.0
  • Transformers 4.46.3

Installation Steps

# Clone the repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR

# Create Conda environment
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr

# Install dependencies
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

Usage Examples

Method 1: Using Transformers

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'

model_name = 'deepseek-ai/DeepSeek-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    _attn_implementation='flash_attention_2', 
    trust_remote_code=True, 
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Configure inference parameters
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

# Execute inference
res = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=1024, 
    image_size=640, 
    crop_mode=True, 
    save_results=True, 
    test_compress=True
)

Method 2: Using vLLM (High-Performance Inference)

# Modify configuration file
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
# Edit config.py to set INPUT_PATH/OUTPUT_PATH

# Run image OCR (streaming output)
python run_dpsk_ocr_image.py

# Run PDF OCR (high concurrency)
python run_dpsk_ocr_pdf.py

# Batch evaluation
python run_dpsk_ocr_eval_batch.py

Technical Innovations

Visual Text Compression Paradigm

DeepSeek-OCR proposes a new visual text compression paradigm:

  • Core Idea: Converts text into images and processes them via a visual encoder, no longer storing semantics in text token form.
  • Advantages:
    • Lower memory footprint: Visual tokens are more compact.
    • Faster inference speed: Fewer tokens = less computation.
    • Natural forgetting mechanism: Old context can be downsampled.
    • Easier multimodal fusion: The model already treats text as images.

Distinction from Traditional OCR

Traditional OCR employs a pipeline architecture (detection → recognition → post-processing), whereas DeepSeek-OCR uses an end-to-end vision-language model architecture, fundamentally simplifying the OCR system.

Resource Links

Acknowledgements

The DeepSeek-OCR project thanks the following open-source projects for their contributions:

  • Vary
  • GOT-OCR2.0
  • MinerU
  • PaddleOCR
  • OneChart
  • Slow Perception

And benchmark datasets: Fox and OmniDocBench.

Conclusion

DeepSeek-OCR represents a significant innovation in OCR technology, addressing the challenges of long-context processing in Large Language Models through a visual compression paradigm. Its efficient token compression capability (7-20x), excellent accuracy (97% accuracy at 10x compression), and powerful processing capacity (200,000 pages per day on a single GPU) make it an ideal choice for document digitization, AI training data generation, and multimodal applications.

The open-source nature of this project and its comprehensive documentation make it easy to integrate into various application scenarios, providing researchers and developers with a powerful tool to explore the boundaries of visual text compression.

Star History Chart