deepseek-ai/DeepSeek-OCR-2 View GitHub Homepage for Latest Official Releases

Advanced OCR model with Visual Causal Flow technology for human-like document understanding and text recognition

Apache-2.0PythonDeepSeek-OCR-2deepseek-ai 1.3k Last Updated: January 27, 2026

DeepSeek-OCR-2: Visual Causal Flow

Overview

DeepSeek-OCR-2 is a revolutionary optical character recognition (OCR) model that introduces the groundbreaking concept of Visual Causal Flow. Released by DeepSeek AI on January 27, 2026, this project represents a paradigm shift from traditional fixed raster-scan processing to semantic-driven visual understanding.

Key Features

🚀 Visual Causal Flow Technology

Dynamic Token Reordering: Instead of mechanically scanning images left-to-right, top-to-bottom, the model dynamically reorders visual tokens based on semantic content
Human-like Processing: Mimics how humans naturally read and understand documents by following logical information flow
Content-Aware Sequencing: Understands semantic relationships between visual elements rather than just spatial positioning

🔧 Technical Architecture

DeepEncoder V2 Architecture

Visual Encoder Upgrade: Replaces CLIP-based encoder with lightweight Qwen2-0.5B language model
Causal Attention Mechanism: Implements "causal flow queries" for semantic-driven visual token reorganization
Two-Stage Processing:
1. Visual encoding with semantic understanding
2. LLM decoder performs autoregressive reasoning on ordered sequences

Performance Improvements

3.7% accuracy improvement over previous OCR models
Better reading order understanding for complex documents
Reduced hallucination and text duplication errors
Production reliability enhancement

📊 Capabilities

Document Processing

Convert documents to Markdown format
Free OCR for various image types
PDF processing with high concurrency
Figure and chart parsing
Layout-aware text extraction

Supported Formats

Images (JPG, PNG, etc.)
PDF documents
Complex layouts and tables
Multi-column documents
Scientific papers and reports

Installation and Usage

Requirements

Python 3.12.9
CUDA 11.8
PyTorch 2.6.0
Flash Attention 2.7.3

Quick Start

Using Transformers

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    _attn_implementation='flash_attention_2', 
    trust_remote_code=True, 
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Document to markdown conversion
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

result = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=1024, 
    image_size=768, 
    crop_mode=True, 
    save_results=True
)

Using vLLM (for high performance)

The project includes vLLM support for faster inference and batch processing, particularly useful for PDF processing and benchmark evaluations.

Prompt Examples

Document conversion: <image>\n<|grounding|>Convert the document to markdown.
General OCR: <image>\nFree OCR.
Figure parsing: <image>\nParse the figure.
Image description: <image>\nDescribe this image in detail.

Technical Innovation

Problem with Traditional OCR

Traditional OCR systems suffer from three critical limitations:

Lower accuracy on complex documents due to fixed scanning patterns
Incorrect reading order interpretation when related information is scattered
Higher error rates in production including text duplication and hallucination

Visual Causal Flow Solution

DeepSeek-OCR-2 addresses these issues by:

Understanding semantic relationships between visual elements
Following logical information flow rather than spatial positioning
Reasoning about visual precedence similar to human document comprehension

Architecture Benefits

Language Model as Visual Encoder: Using Qwen2-0.5B enables semantic understanding of visual content
Causal Attention: Allows the model to reason about which visual elements logically precede others
Efficiency: Balances semantic understanding capability with computational efficiency

Performance and Benchmarks

Accuracy Improvements

3.7% better performance compared to previous OCR models
Superior reading order understanding for complex layouts
Reduced error rates in production environments
Better handling of tables, figures, and multi-column layouts

Use Cases

Academic paper processing
Business document digitization
Legal document analysis
Technical manual conversion
Scientific publication parsing

Project Structure

DeepSeek-OCR-2/
├── DeepSeek-OCR2-master/          # Core implementation
│   ├── DeepSeek-OCR2-vllm/       # vLLM inference scripts
│   └── DeepSeek-OCR2-hf/         # Hugging Face transformers scripts
├── assets/                        # Project assets and figures
├── DeepSeek_OCR2_paper.pdf       # Research paper
├── requirements.txt               # Python dependencies
└── README.md                      # Project documentation

Research and Development

Academic Contribution

Research Paper: "DeepSeek-OCR 2: Visual Causal Flow"
Open Source: Available on GitHub and Hugging Face
License: Apache 2.0

Future Development

2D Image Understanding: Plans to implement true 2D reasoning through cascaded 1D causal reasoners
Broader VLM Applications: Visual Causal Flow concept applicable to other vision-language tasks
Enhanced Spatial Reasoning: Improved understanding of complex visual layouts

Comparison with Previous Models

Feature	Traditional OCR	DeepSeek-OCR	DeepSeek-OCR-2
Scanning Method	Fixed raster-scan	Compressed visual tokens	Semantic causal flow
Reading Order	Spatial only	Improved spatial	Semantic understanding
Visual Encoder	CLIP-based	CLIP-based	Qwen2-0.5B LM
Accuracy	Baseline	Improved	+3.7% improvement
Semantic Understanding	Limited	Better	Human-like

Community and Resources

Acknowledgments

The project builds upon and acknowledges contributions from:

DeepSeek-OCR
Vary
GOT-OCR2.0
MinerU
PaddleOCR
OmniDocBench (for benchmarking)

Conclusion

DeepSeek-OCR-2 represents a significant advancement in OCR technology by introducing Visual Causal Flow, which enables more human-like document understanding. This innovation addresses fundamental limitations of traditional OCR systems and opens new possibilities for document processing applications across various industries.

The project's open-source nature, comprehensive documentation, and strong performance improvements make it an valuable tool for researchers, developers, and organizations requiring advanced document processing capabilities.