Advanced OCR model with Visual Causal Flow technology for human-like document understanding and text recognition

Apache-2.0PythonDeepSeek-OCR-2deepseek-ai 1.3k Last Updated: January 27, 2026

DeepSeek-OCR-2: Visual Causal Flow

Overview

DeepSeek-OCR-2 is a revolutionary optical character recognition (OCR) model that introduces the groundbreaking concept of Visual Causal Flow. Released by DeepSeek AI on January 27, 2026, this project represents a paradigm shift from traditional fixed raster-scan processing to semantic-driven visual understanding.

Key Features

🚀 Visual Causal Flow Technology

  • Dynamic Token Reordering: Instead of mechanically scanning images left-to-right, top-to-bottom, the model dynamically reorders visual tokens based on semantic content
  • Human-like Processing: Mimics how humans naturally read and understand documents by following logical information flow
  • Content-Aware Sequencing: Understands semantic relationships between visual elements rather than just spatial positioning

🔧 Technical Architecture

DeepEncoder V2 Architecture

  • Visual Encoder Upgrade: Replaces CLIP-based encoder with lightweight Qwen2-0.5B language model
  • Causal Attention Mechanism: Implements "causal flow queries" for semantic-driven visual token reorganization
  • Two-Stage Processing:
    1. Visual encoding with semantic understanding
    2. LLM decoder performs autoregressive reasoning on ordered sequences

Performance Improvements

  • 3.7% accuracy improvement over previous OCR models
  • Better reading order understanding for complex documents
  • Reduced hallucination and text duplication errors
  • Production reliability enhancement

📊 Capabilities

Document Processing

  • Convert documents to Markdown format
  • Free OCR for various image types
  • PDF processing with high concurrency
  • Figure and chart parsing
  • Layout-aware text extraction

Supported Formats

  • Images (JPG, PNG, etc.)
  • PDF documents
  • Complex layouts and tables
  • Multi-column documents
  • Scientific papers and reports

Installation and Usage

Requirements

  • Python 3.12.9
  • CUDA 11.8
  • PyTorch 2.6.0
  • Flash Attention 2.7.3

Quick Start

Using Transformers

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_name = 'deepseek-ai/DeepSeek-OCR-2'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name, 
    _attn_implementation='flash_attention_2', 
    trust_remote_code=True, 
    use_safetensors=True
)
model = model.eval().cuda().to(torch.bfloat16)

# Document to markdown conversion
prompt = "<image>\n<|grounding|>Convert the document to markdown."
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

result = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=1024, 
    image_size=768, 
    crop_mode=True, 
    save_results=True
)

Using vLLM (for high performance)

The project includes vLLM support for faster inference and batch processing, particularly useful for PDF processing and benchmark evaluations.

Prompt Examples

  • Document conversion: <image>\n<|grounding|>Convert the document to markdown.
  • General OCR: <image>\nFree OCR.
  • Figure parsing: <image>\nParse the figure.
  • Image description: <image>\nDescribe this image in detail.

Technical Innovation

Problem with Traditional OCR

Traditional OCR systems suffer from three critical limitations:

  1. Lower accuracy on complex documents due to fixed scanning patterns
  2. Incorrect reading order interpretation when related information is scattered
  3. Higher error rates in production including text duplication and hallucination

Visual Causal Flow Solution

DeepSeek-OCR-2 addresses these issues by:

  • Understanding semantic relationships between visual elements
  • Following logical information flow rather than spatial positioning
  • Reasoning about visual precedence similar to human document comprehension

Architecture Benefits

  • Language Model as Visual Encoder: Using Qwen2-0.5B enables semantic understanding of visual content
  • Causal Attention: Allows the model to reason about which visual elements logically precede others
  • Efficiency: Balances semantic understanding capability with computational efficiency

Performance and Benchmarks

Accuracy Improvements

  • 3.7% better performance compared to previous OCR models
  • Superior reading order understanding for complex layouts
  • Reduced error rates in production environments
  • Better handling of tables, figures, and multi-column layouts

Use Cases

  • Academic paper processing
  • Business document digitization
  • Legal document analysis
  • Technical manual conversion
  • Scientific publication parsing

Project Structure

DeepSeek-OCR-2/
├── DeepSeek-OCR2-master/          # Core implementation
│   ├── DeepSeek-OCR2-vllm/       # vLLM inference scripts
│   └── DeepSeek-OCR2-hf/         # Hugging Face transformers scripts
├── assets/                        # Project assets and figures
├── DeepSeek_OCR2_paper.pdf       # Research paper
├── requirements.txt               # Python dependencies
└── README.md                      # Project documentation

Research and Development

Academic Contribution

  • Research Paper: "DeepSeek-OCR 2: Visual Causal Flow"
  • Open Source: Available on GitHub and Hugging Face
  • License: Apache 2.0

Future Development

  • 2D Image Understanding: Plans to implement true 2D reasoning through cascaded 1D causal reasoners
  • Broader VLM Applications: Visual Causal Flow concept applicable to other vision-language tasks
  • Enhanced Spatial Reasoning: Improved understanding of complex visual layouts

Comparison with Previous Models

Feature Traditional OCR DeepSeek-OCR DeepSeek-OCR-2
Scanning Method Fixed raster-scan Compressed visual tokens Semantic causal flow
Reading Order Spatial only Improved spatial Semantic understanding
Visual Encoder CLIP-based CLIP-based Qwen2-0.5B LM
Accuracy Baseline Improved +3.7% improvement
Semantic Understanding Limited Better Human-like

Community and Resources

Links

Acknowledgments

The project builds upon and acknowledges contributions from:

  • DeepSeek-OCR
  • Vary
  • GOT-OCR2.0
  • MinerU
  • PaddleOCR
  • OmniDocBench (for benchmarking)

Conclusion

DeepSeek-OCR-2 represents a significant advancement in OCR technology by introducing Visual Causal Flow, which enables more human-like document understanding. This innovation addresses fundamental limitations of traditional OCR systems and opens new possibilities for document processing applications across various industries.

The project's open-source nature, comprehensive documentation, and strong performance improvements make it an valuable tool for researchers, developers, and organizations requiring advanced document processing capabilities.

Star History Chart