X-PLUG/MobileAgentView GitHub Homepage for Latest Official Releases

A powerful family of multimodal GUI automation agents supporting end-to-end operations on mobile devices and PC platforms.

MITPythonMobileAgentX-PLUG 5.6k Last Updated: September 11, 2025

Detailed Introduction to the Mobile-Agent Project

Project Overview

Mobile-Agent is a powerful family of GUI agents developed by Alibaba X-PLUG team, an end-to-end multimodal agent system designed for mobile devices and PC platforms. The project aims to achieve GUI automation by autonomously operating various applications through visual perception, reasoning planning, and action execution.

Project Architecture and Components

Core Component Series

1. GUI-Owl Foundation Model

GUI-Owl is a foundational GUI agent model that has achieved state-of-the-art performance among open-source end-to-end models across ten GUI benchmarks, covering localization, Q&A, planning, decision-making, and procedural knowledge in both desktop and mobile environments. GUI-Owl-7B scored 66.4 on AndroidWorld and 29.4 on OSWorld.

2. Mobile-Agent-v3

Mobile-Agent-v3 is a cross-platform multi-agent framework based on GUI-Owl, offering functionalities such as planning, progress management, reflection, and memory. It is a native end-to-end multimodal agent designed as a foundation model for GUI automation, unifying perception, localization, reasoning, planning, and action execution within a single policy network.

3. Mobile-Agent-E

Mobile-Agent-E is a self-evolving hierarchical multi-agent framework capable of self-evolution through past experiences, demonstrating stronger performance on complex multi-application tasks.

4. PC-Agent

PC-Agent is a multi-agent collaboration system that can automate productivity scenarios (e.g., Chrome, Word, WeChat) based on user instructions. Its active perception module, designed for dense and diverse interactive elements, is better adapted to the PC platform. The hierarchical multi-agent collaboration structure improves the success rate for more complex task sequences. It now supports both Windows and Mac.

5. Mobile-Agent-v2

Mobile-Agent-v2 is a mobile device operation assistant that achieves effective navigation via multi-agent collaboration. Its multi-agent architecture addresses navigation challenges in long-context input scenarios. An enhanced visual perception module significantly improves operation accuracy.

Technical Features

Core Technical Advantages

Cross-Platform Compatibility: Supports multiple platforms including Android, iOS, Windows, and Mac.
Visual Perception Capability: Utilizes visual perception tools to accurately identify and locate visual and text elements in application front-end interfaces.
Multimodal Understanding: Combines visual and language understanding for complex task reasoning.
End-to-End Operation: A complete automation process from task understanding to execution.
Self-Evolution: Continuously improves performance through experiential learning.

Technical Innovations

Three Innovations of GUI-Owl

Large-Scale Environment Infrastructure: Cloud-based virtual environments covering Android, Ubuntu, macOS, and Windows, supporting a self-evolving GUI trajectory production framework.
Diverse Foundation Agent Capabilities: Integrates UI localization, planning, action semantics, and reasoning patterns, supporting end-to-end decision-making.
Scalable Environment Reinforcement Learning: Developed a scalable reinforcement learning framework with fully asynchronous training for real-world alignment.

Performance

Benchmark Achievements

Mobile-Agent-v3 achieved 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art standard for open-source GUI agent frameworks.
Achieved SOTA performance on multiple GUI automation evaluation leaderboards, including ScreenSpot-V2, ScreenSpot-Pro, OSWorld-G, MMBench-GUI, Android Control, Android World, and OSWorld.

System Performance Optimization

Low memory overhead (8GB)
Fast inference speed (10-15 seconds per operation)
All models are open-source

Technical Implementation

Environment Requirements

# Basic environment setup
git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent
pip install -r requirements.txt

Android Platform Configuration

Download Android Debug Bridge (ADB).
Enable ADB debugging on your Android phone.
Connect your phone to the computer with a data cable and select "File transfer".
Test ADB environment: /path/to/adb devices

PC Platform Configuration

# Windows environment
pip install -r requirements.txt

# Mac environment  
pip install -r requirements_mac.txt

API Configuration

{
  "vl_model_name": "gpt-4o",
  "llm_model_name": "gpt-4o", 
  "token": "sk-...",
  "url": "https://api.openai.com/v1"
}

Application Scenarios

Supported Operation Types

Mobile Application Operations: Click, swipe, text input, app switching.
PC Application Operations: Browser control, office software operations, communication software usage.
Cross-Application Tasks: Complex workflows across multiple applications.
Complex Reasoning Tasks: Long-term tasks requiring multi-step reasoning.

Practical Application Examples

Online Shopping: Searching for products, comparing prices, adding to cart.
Information Retrieval: Searching for news, getting sports match results.
Office Automation: Writing documents, sending emails, data processing.
Social Media: Posting content, replying to messages, sharing information.

Academic Achievements

Published Papers

Mobile-Agent-v3 (2025): Foundamental Agents for GUI Automation
PC-Agent (ICLR 2025 Workshop): A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
Mobile-Agent-E (2025): Self-Evolving Mobile Assistant for Complex Tasks
Mobile-Agent-v2 (NeurIPS 2024): Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
Mobile-Agent (ICLR 2024 Workshop): Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Awards

Best Demo Award at the 24th China National Conference on Computational Linguistics (CCL 2025)
Best Demo Award at the 23rd China National Conference on Computational Linguistics (CCL 2024)

Evaluation Benchmarks

Mobile-Eval Benchmark

Mobile-Eval is a benchmark designed to evaluate the performance of mobile device agents, including 10 mainstream single-application scenarios and 1 multi-application scenario. Each scenario features three types of instructions.

Test Scenario Examples

Shopping Task: Find a hat on Alibaba and add it to the shopping cart.
Music Playback: Search for singer Jay Chou in Amazon Music.
Information Query: Search for today's Lakers game results.
Email Sending: Send an empty email to a specified address.

Technology Stack

Core Technologies

Multimodal Large Language Models: GPT-4V, Qwen-VL, etc.
Visual Perception: CLIP, GroundingDINO, etc.
Reinforcement Learning: Trajectory-aware Relative Policy Optimization (TRPO)
Multi-Agent Framework: Hierarchical collaboration architecture

Supported Platforms

Mobile Platforms: Android, HarmonyOS (≤ version 4)
Desktop Platforms: Windows, macOS, Ubuntu
Browsers: Chrome and other mainstream browsers
Office Software: Word, Excel, PowerPoint, etc.

Open-Source Information

Repository Structure

MobileAgent/
├── Mobile-Agent/          # Original version
├── Mobile-Agent-v2/       # Multi-agent collaboration version
├── Mobile-Agent-v3/       # Latest version based on GUI-Owl
├── Mobile-Agent-E/        # Self-evolving version
├── PC-Agent/             # PC platform version
└── requirements.txt      # Dependencies

Model Release

GUI-Owl-7B and GUI-Owl-32B model checkpoints have been released.
Supports deployment on HuggingFace and ModelScope platforms.
Online demo experience is provided.

Community and Ecosystem

Online Demos

Related Projects

AppAgent: Multimodal Agents as Smartphone Users
mPLUG-Owl: Modular Multimodal Large Language Model
Qwen-VL: General Vision-Language Model
GroundingDINO: Open-Set Object Detection

Future Development

This project represents the cutting edge of GUI automation agent development. Through continuous technical innovation and performance optimization, it paves the way for achieving true general AI assistants. As model capabilities improve and application scenarios expand, Mobile-Agent is expected to play a significant role in more practical settings.