Real-world centric foundation GUI agents with native user interaction, MCP tool integration, and device-cloud collaboration capabilities

Apache-2.0Jupyter NotebookMAI-UITongyi-MAI 1.6k Last Updated: January 15, 2026

MAI-UI: Real-World Centric Foundation GUI Agents

Overview

MAI-UI is a comprehensive family of foundation GUI agents developed by Alibaba's Tongyi Lab that spans the full spectrum of model sizes from 2B to 235B-A22B parameters. The project represents a significant advancement in making GUI agents practical for real-world deployment through innovative approaches to user interaction, tool integration, and deployment architecture.

Key Features & Innovations

1. Multi-Scale Foundation Models

  • Model Variants: 2B, 8B, 32B, and 235B-A22B parameters
  • Base Architecture: Built on Qwen3-VL multimodal large language models
  • Training Approach: Joint supervised fine-tuning and reinforcement learning
  • Deployment Flexibility: Suitable for various hardware constraints and performance requirements

2. Extended Action Space

MAI-UI introduces three critical capabilities beyond traditional GUI operations:

Agent-User Interaction

  • ask_user action: Proactively requests clarification for ambiguous instructions
  • Dynamic conversation: Handles incomplete or unclear user requirements
  • Real-world applicability: Addresses the common scenario where user instructions lack specificity

MCP Tool Integration

  • mcp_call action: Direct invocation of external tools through Model Context Protocol
  • API-level operations: Efficient alternatives to complex UI manipulations
  • Enhanced functionality: Access to services like mapping, file management, and data retrieval

Device-Cloud Collaboration

  • Intelligent routing: Dynamic selection between on-device and cloud execution
  • Privacy preservation: Keeps sensitive operations local while leveraging cloud for complex tasks
  • Cost optimization: Reduces cloud API calls by over 40%

3. Self-Evolving Data Pipeline

  • Autonomous data generation: Continuous improvement of training corpus
  • Multi-agent collaboration: Combination of human annotations and model-generated trajectories
  • Quality filtering: Judge models evaluate and retain high-quality execution paths
  • Dynamic adaptation: Training data evolves with model capabilities

4. Large-Scale Online Reinforcement Learning

  • Massive parallelization: Up to 512 parallel Android environments
  • Extended context: Support for up to 50 environment steps
  • Significant improvements: +5.2 points from environment scaling, +4.3 points from step budget increase
  • Real-world robustness: Training in dynamic environments with pop-ups, ads, and UI changes

Performance Achievements

GUI Grounding Benchmarks

  • ScreenSpot-Pro: 73.5% accuracy (surpasses Gemini-3-Pro and Seed1.8)
  • MMBench GUI L2: 91.3% accuracy
  • OSWorld-G: 70.9% accuracy
  • UI-Vision: 49.2% accuracy

Mobile Navigation Benchmarks

  • AndroidWorld: 76.7% success rate (new SOTA, surpassing UI-Tars-2, Gemini-2.5-Pro, and Seed1.8)
  • MobileWorld: 41.7% success rate (20.8 point improvement over strongest baselines)

Device-Cloud Collaboration Results

  • Performance improvement: 33% enhancement in on-device performance
  • Cost reduction: Over 40% reduction in cloud model calls
  • Privacy preservation: 40.5% of tasks completed entirely on-device

Technical Architecture

Model Foundation

  • Backbone: Qwen3-VL multimodal architecture
  • Input modalities: Natural language instructions and rendered UI screenshots
  • Output: Structured actions for live Android devices
  • Action space: Click, swipe, text input, system buttons, plus enhanced interaction capabilities

Training Methodology

  1. Supervised Fine-tuning: Initial training on curated GUI grounding and navigation data
  2. Online Reinforcement Learning: Continuous improvement through interaction with live environments
  3. Self-evolving pipeline: Autonomous data generation and quality improvement
  4. Multi-dimensional integration: User interactions, MCP tool calls, and traditional GUI operations

Deployment System

  • Hybrid architecture: Seamless integration of on-device and cloud models
  • Task-aware routing: Intelligent decision-making based on task complexity and privacy requirements
  • Privacy-first design: Sensitive operations remain local while complex tasks leverage cloud capabilities
  • Cost optimization: Efficient resource utilization through intelligent workload distribution

Real-World Applications

Home & Personal Use

  • Smart shopping: Proactive suggestions based on calendar integration
  • Task automation: Complex multi-app workflows for daily activities
  • Contextual assistance: Understanding user intent through natural conversation

Professional & Office Use

  • Document management: Intelligent file handling and sharing
  • Communication assistance: Email composition with context awareness
  • Cross-app integration: Seamless workflows across multiple applications

Navigation & Location Services

  • Route planning: Integration with mapping services through MCP tools
  • Location-aware suggestions: Context-sensitive recommendations
  • Multi-modal transportation: Support for various transportation methods

Technical Specifications

Requirements

  • vLLM: Version ≥0.11.0
  • Transformers: Version ≥4.57.0
  • Python: Compatible with standard ML ecosystem
  • Hardware: Scalable from mobile devices to cloud infrastructure

Available Models

  • MAI-UI-2B: Lightweight model for resource-constrained environments
  • MAI-UI-8B: Balanced performance and efficiency
  • Larger variants: 32B and 235B-A22B for maximum capability

Integration Options

  • API service: OpenAI-compatible interface through vLLM
  • Direct integration: Python SDK for custom applications
  • Container deployment: Docker support for scalable deployment

Research Impact

Benchmark Leadership

MAI-UI establishes new state-of-the-art performance across multiple authoritative benchmarks, demonstrating both theoretical advancement and practical applicability.

Methodological Contributions

  • Device-cloud collaboration: Novel deployment architecture for GUI agents
  • Self-evolving data: Autonomous improvement of training datasets
  • Extended interaction model: Native support for user dialogue and tool integration

Industry Applications

The project addresses real-world deployment challenges that have historically limited GUI agent adoption, making it suitable for production environments.

Open Source Commitment

Licensing

  • Apache License 2.0: Permissive licensing for commercial and research use
  • Third-party components: Clearly documented with appropriate attributions
  • Community contribution: Open development model encouraging collaboration

Available Resources

  • Models: MAI-UI-2B and MAI-UI-8B on Hugging Face
  • Code: Complete implementation on GitHub
  • Documentation: Comprehensive technical reports and usage guides
  • Benchmarks: MobileWorld benchmark for evaluation

Future Directions

Research Extensions

  • Larger model variants: Continued development of 32B and 235B models
  • Cross-platform support: Extension beyond Android to iOS and desktop platforms
  • Enhanced tool integration: Broader MCP tool ecosystem

Commercial Applications

  • Enterprise deployment: Integration with business workflows
  • Accessibility solutions: Assistance for users with disabilities
  • Productivity enhancement: Advanced automation for knowledge workers

Citation Information

@misc{zhou2025maiuitechnicalreportrealworld,
  title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents},
  author={Hanzhang Zhou and Xu Zhang and Panrong Tong and Jianan Zhang and Liangyu Chen and Quyu Kong and Chenglin Cai and Chen Liu and Yue Wang and Jingren Zhou and Steven Hoi},
  year={2025},
  eprint={2512.22047},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.22047}
}

Contact Information

Additional Resources

Star History Chart