Home
Login

A simple screen parsing tool for purely visual GUI agents

CC-BY-4.0Jupyter Notebook 22.5kmicrosoft Last Updated: 2025-03-26

OmniParser Project Details

Project Overview

OmniParser is a comprehensive approach to parsing user interface screenshots into structured and easily understandable elements, significantly enhancing the ability of GPT-4V to generate operations that accurately locate corresponding areas on the interface.

Project Address: https://github.com/microsoft/OmniParser

Core Features

1. Screen Parsing Capabilities

  • Interactive Icon Detection: Reliably identifies interactive icons in the user interface.
  • Semantic Understanding: Understands the semantics of various elements in the screenshot and accurately associates expected actions with corresponding areas on the screen.
  • Structured Output: Converts UI screenshots into a structured format, improving LLM-based UI agents.

2. Technical Architecture

OmniParser includes two main components:

  • Interactive Icon Detection Dataset: Curated from popular websites and automatically annotated, highlighting clickable and actionable areas.
  • Icon Description Dataset: Associates each UI element with its corresponding function.

Key Features

OmniTool

OmniTool: Control a Windows 11 virtual machine using OmniParser + your choice of vision model.

Supported Features:

  • Multi-agent orchestration
  • Trajectory local logging
  • Training data pipeline for your domain
  • Improved user interface experience

Supported Models

  • OpenAI GPT-4o/o1/o3-mini
  • DeepSeek R1
  • Qwen 2.5VL
  • Anthropic Computer Use

Installation and Usage

Environment Configuration

cd OmniParser
conda create -n "omni" python==3.12
conda activate omni
pip install -r requirements.txt

Download Model Weights

for f in icon_detect/{train_args.yaml,model.pt,model.yaml} icon_caption/{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft/OmniParser-v2.0 "$f" --local-dir weights; done
mv weights/icon_caption weights/icon_caption_florence

Run Demo

python gradio_demo.py

Performance

  • Achieves state-of-the-art performance on Windows Agent Arena
  • Achieves a state-of-the-art result of 39.5% on the Screen Spot Pro GUI localization benchmark
  • Significantly improves the accuracy of GPT-4V in GUI operation tasks.

Application Scenarios

  1. GUI Automated Testing: Automatically identify and manipulate user interface elements.
  2. Intelligent Assistant Development: Build AI assistants that can understand and operate graphical interfaces.
  3. Accessibility Technology: Help visually impaired users understand screen content.
  4. Process Automation: Automate repetitive GUI operation tasks.
  5. User Experience Research: Analyze the usability and interactivity of user interfaces.

Technical Advantages

  1. Purely Visual Approach: Does not rely on underlying UI code or APIs, works solely through visual information.
  2. High-Precision Localization: Accurately identifies the location and function of interactive elements.
  3. Cross-Platform Compatibility: Supports multiple operating systems and applications.
  4. Scalability: Supports integration with various large language models.

Datasets and Models

Detection Model

  • Based on YOLO architecture
  • Uses AGPL license
  • Specifically trained for UI element detection

Description Model

  • Based on BLIP2 and Florence architectures
  • Specifically designed for generating functional descriptions of UI elements

Related Links