microsoft/OmniParserPlease refer to the latest official releases for information GitHub Homepage

A simple screen parsing tool for purely visual GUI agents

CC-BY-4.0Jupyter Notebook 22.5kmicrosoft Last Updated: 2025-03-26

OmniParser Project Details

Project Overview

OmniParser is a comprehensive approach to parsing user interface screenshots into structured and easily understandable elements, significantly enhancing the ability of GPT-4V to generate operations that accurately locate corresponding areas on the interface.

Project Address: https://github.com/microsoft/OmniParser

Core Features

1. Screen Parsing Capabilities

Interactive Icon Detection: Reliably identifies interactive icons in the user interface.
Semantic Understanding: Understands the semantics of various elements in the screenshot and accurately associates expected actions with corresponding areas on the screen.
Structured Output: Converts UI screenshots into a structured format, improving LLM-based UI agents.

2. Technical Architecture

OmniParser includes two main components:

Interactive Icon Detection Dataset: Curated from popular websites and automatically annotated, highlighting clickable and actionable areas.
Icon Description Dataset: Associates each UI element with its corresponding function.

Key Features

OmniTool

OmniTool: Control a Windows 11 virtual machine using OmniParser + your choice of vision model.

Supported Features:

Multi-agent orchestration
Trajectory local logging
Training data pipeline for your domain
Improved user interface experience

Supported Models

OpenAI GPT-4o/o1/o3-mini
DeepSeek R1
Qwen 2.5VL
Anthropic Computer Use

Installation and Usage

Environment Configuration

cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Download Model Weights

for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

Run Demo

python gradio_demo.py

Performance

Achieves state-of-the-art performance on Windows Agent Arena
Achieves a state-of-the-art result of 39.5% on the Screen Spot Pro GUI localization benchmark
Significantly improves the accuracy of GPT-4V in GUI operation tasks.

Application Scenarios

GUI Automated Testing: Automatically identify and manipulate user interface elements.
Intelligent Assistant Development: Build AI assistants that can understand and operate graphical interfaces.
Accessibility Technology: Help visually impaired users understand screen content.
Process Automation: Automate repetitive GUI operation tasks.
User Experience Research: Analyze the usability and interactivity of user interfaces.

Technical Advantages

Purely Visual Approach: Does not rely on underlying UI code or APIs, works solely through visual information.
High-Precision Localization: Accurately identifies the location and function of interactive elements.
Cross-Platform Compatibility: Supports multiple operating systems and applications.
Scalability: Supports integration with various large language models.