A powerful family of multimodal GUI automation agents supporting end-to-end operations on mobile devices and PC platforms.
Detailed Introduction to the Mobile-Agent Project
Project Overview
Mobile-Agent is a powerful family of GUI agents developed by Alibaba X-PLUG team, an end-to-end multimodal agent system designed for mobile devices and PC platforms. The project aims to achieve GUI automation by autonomously operating various applications through visual perception, reasoning planning, and action execution.
Project Architecture and Components
Core Component Series
1. GUI-Owl Foundation Model
GUI-Owl is a foundational GUI agent model that has achieved state-of-the-art performance among open-source end-to-end models across ten GUI benchmarks, covering localization, Q&A, planning, decision-making, and procedural knowledge in both desktop and mobile environments. GUI-Owl-7B scored 66.4 on AndroidWorld and 29.4 on OSWorld.
2. Mobile-Agent-v3
Mobile-Agent-v3 is a cross-platform multi-agent framework based on GUI-Owl, offering functionalities such as planning, progress management, reflection, and memory. It is a native end-to-end multimodal agent designed as a foundation model for GUI automation, unifying perception, localization, reasoning, planning, and action execution within a single policy network.
3. Mobile-Agent-E
Mobile-Agent-E is a self-evolving hierarchical multi-agent framework capable of self-evolution through past experiences, demonstrating stronger performance on complex multi-application tasks.
4. PC-Agent
PC-Agent is a multi-agent collaboration system that can automate productivity scenarios (e.g., Chrome, Word, WeChat) based on user instructions. Its active perception module, designed for dense and diverse interactive elements, is better adapted to the PC platform. The hierarchical multi-agent collaboration structure improves the success rate for more complex task sequences. It now supports both Windows and Mac.
5. Mobile-Agent-v2
Mobile-Agent-v2 is a mobile device operation assistant that achieves effective navigation via multi-agent collaboration. Its multi-agent architecture addresses navigation challenges in long-context input scenarios. An enhanced visual perception module significantly improves operation accuracy.
Technical Features
Core Technical Advantages
- Cross-Platform Compatibility: Supports multiple platforms including Android, iOS, Windows, and Mac.
- Visual Perception Capability: Utilizes visual perception tools to accurately identify and locate visual and text elements in application front-end interfaces.
- Multimodal Understanding: Combines visual and language understanding for complex task reasoning.
- End-to-End Operation: A complete automation process from task understanding to execution.
- Self-Evolution: Continuously improves performance through experiential learning.
Technical Innovations
Three Innovations of GUI-Owl
- Large-Scale Environment Infrastructure: Cloud-based virtual environments covering Android, Ubuntu, macOS, and Windows, supporting a self-evolving GUI trajectory production framework.
- Diverse Foundation Agent Capabilities: Integrates UI localization, planning, action semantics, and reasoning patterns, supporting end-to-end decision-making.
- Scalable Environment Reinforcement Learning: Developed a scalable reinforcement learning framework with fully asynchronous training for real-world alignment.
Performance
Benchmark Achievements
- Mobile-Agent-v3 achieved 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art standard for open-source GUI agent frameworks.
- Achieved SOTA performance on multiple GUI automation evaluation leaderboards, including ScreenSpot-V2, ScreenSpot-Pro, OSWorld-G, MMBench-GUI, Android Control, Android World, and OSWorld.
System Performance Optimization
- Low memory overhead (8GB)
- Fast inference speed (10-15 seconds per operation)
- All models are open-source
Technical Implementation
Environment Requirements
# Basic environment setup
git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent
pip install -r requirements.txt
Android Platform Configuration
- Download Android Debug Bridge (ADB).
- Enable ADB debugging on your Android phone.
- Connect your phone to the computer with a data cable and select "File transfer".
- Test ADB environment:
/path/to/adb devices
PC Platform Configuration
# Windows environment
pip install -r requirements.txt
# Mac environment
pip install -r requirements_mac.txt
API Configuration
{
"vl_model_name": "gpt-4o",
"llm_model_name": "gpt-4o",
"token": "sk-...",
"url": "https://api.openai.com/v1"
}
Application Scenarios
Supported Operation Types
- Mobile Application Operations: Click, swipe, text input, app switching.
- PC Application Operations: Browser control, office software operations, communication software usage.
- Cross-Application Tasks: Complex workflows across multiple applications.
- Complex Reasoning Tasks: Long-term tasks requiring multi-step reasoning.
Practical Application Examples
- Online Shopping: Searching for products, comparing prices, adding to cart.
- Information Retrieval: Searching for news, getting sports match results.
- Office Automation: Writing documents, sending emails, data processing.
- Social Media: Posting content, replying to messages, sharing information.
Academic Achievements
Published Papers
- Mobile-Agent-v3 (2025): Foundamental Agents for GUI Automation
- PC-Agent (ICLR 2025 Workshop): A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
- Mobile-Agent-E (2025): Self-Evolving Mobile Assistant for Complex Tasks
- Mobile-Agent-v2 (NeurIPS 2024): Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
- Mobile-Agent (ICLR 2024 Workshop): Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Awards
- Best Demo Award at the 24th China National Conference on Computational Linguistics (CCL 2025)
- Best Demo Award at the 23rd China National Conference on Computational Linguistics (CCL 2024)
Evaluation Benchmarks
Mobile-Eval Benchmark
Mobile-Eval is a benchmark designed to evaluate the performance of mobile device agents, including 10 mainstream single-application scenarios and 1 multi-application scenario. Each scenario features three types of instructions.
Test Scenario Examples
- Shopping Task: Find a hat on Alibaba and add it to the shopping cart.
- Music Playback: Search for singer Jay Chou in Amazon Music.
- Information Query: Search for today's Lakers game results.
- Email Sending: Send an empty email to a specified address.
Technology Stack
Core Technologies
- Multimodal Large Language Models: GPT-4V, Qwen-VL, etc.
- Visual Perception: CLIP, GroundingDINO, etc.
- Reinforcement Learning: Trajectory-aware Relative Policy Optimization (TRPO)
- Multi-Agent Framework: Hierarchical collaboration architecture
Supported Platforms
- Mobile Platforms: Android, HarmonyOS (≤ version 4)
- Desktop Platforms: Windows, macOS, Ubuntu
- Browsers: Chrome and other mainstream browsers
- Office Software: Word, Excel, PowerPoint, etc.
Open-Source Information
Repository Structure
MobileAgent/
├── Mobile-Agent/ # Original version
├── Mobile-Agent-v2/ # Multi-agent collaboration version
├── Mobile-Agent-v3/ # Latest version based on GUI-Owl
├── Mobile-Agent-E/ # Self-evolving version
├── PC-Agent/ # PC platform version
└── requirements.txt # Dependencies
Model Release
- GUI-Owl-7B and GUI-Owl-32B model checkpoints have been released.
- Supports deployment on HuggingFace and ModelScope platforms.
- Online demo experience is provided.
Community and Ecosystem
Online Demos
Related Projects
- AppAgent: Multimodal Agents as Smartphone Users
- mPLUG-Owl: Modular Multimodal Large Language Model
- Qwen-VL: General Vision-Language Model
- GroundingDINO: Open-Set Object Detection
Future Development
This project represents the cutting edge of GUI automation agent development. Through continuous technical innovation and performance optimization, it paves the way for achieving true general AI assistants. As model capabilities improve and application scenarios expand, Mobile-Agent is expected to play a significant role in more practical settings.