Real-world centric foundation GUI agents with native user interaction, MCP tool integration, and device-cloud collaboration capabilities
MAI-UI: Real-World Centric Foundation GUI Agents
Overview
MAI-UI is a comprehensive family of foundation GUI agents developed by Alibaba's Tongyi Lab that spans the full spectrum of model sizes from 2B to 235B-A22B parameters. The project represents a significant advancement in making GUI agents practical for real-world deployment through innovative approaches to user interaction, tool integration, and deployment architecture.
Key Features & Innovations
1. Multi-Scale Foundation Models
- Model Variants: 2B, 8B, 32B, and 235B-A22B parameters
- Base Architecture: Built on Qwen3-VL multimodal large language models
- Training Approach: Joint supervised fine-tuning and reinforcement learning
- Deployment Flexibility: Suitable for various hardware constraints and performance requirements
2. Extended Action Space
MAI-UI introduces three critical capabilities beyond traditional GUI operations:
Agent-User Interaction
ask_useraction: Proactively requests clarification for ambiguous instructions- Dynamic conversation: Handles incomplete or unclear user requirements
- Real-world applicability: Addresses the common scenario where user instructions lack specificity
MCP Tool Integration
mcp_callaction: Direct invocation of external tools through Model Context Protocol- API-level operations: Efficient alternatives to complex UI manipulations
- Enhanced functionality: Access to services like mapping, file management, and data retrieval
Device-Cloud Collaboration
- Intelligent routing: Dynamic selection between on-device and cloud execution
- Privacy preservation: Keeps sensitive operations local while leveraging cloud for complex tasks
- Cost optimization: Reduces cloud API calls by over 40%
3. Self-Evolving Data Pipeline
- Autonomous data generation: Continuous improvement of training corpus
- Multi-agent collaboration: Combination of human annotations and model-generated trajectories
- Quality filtering: Judge models evaluate and retain high-quality execution paths
- Dynamic adaptation: Training data evolves with model capabilities
4. Large-Scale Online Reinforcement Learning
- Massive parallelization: Up to 512 parallel Android environments
- Extended context: Support for up to 50 environment steps
- Significant improvements: +5.2 points from environment scaling, +4.3 points from step budget increase
- Real-world robustness: Training in dynamic environments with pop-ups, ads, and UI changes
Performance Achievements
GUI Grounding Benchmarks
- ScreenSpot-Pro: 73.5% accuracy (surpasses Gemini-3-Pro and Seed1.8)
- MMBench GUI L2: 91.3% accuracy
- OSWorld-G: 70.9% accuracy
- UI-Vision: 49.2% accuracy
Mobile Navigation Benchmarks
- AndroidWorld: 76.7% success rate (new SOTA, surpassing UI-Tars-2, Gemini-2.5-Pro, and Seed1.8)
- MobileWorld: 41.7% success rate (20.8 point improvement over strongest baselines)
Device-Cloud Collaboration Results
- Performance improvement: 33% enhancement in on-device performance
- Cost reduction: Over 40% reduction in cloud model calls
- Privacy preservation: 40.5% of tasks completed entirely on-device
Technical Architecture
Model Foundation
- Backbone: Qwen3-VL multimodal architecture
- Input modalities: Natural language instructions and rendered UI screenshots
- Output: Structured actions for live Android devices
- Action space: Click, swipe, text input, system buttons, plus enhanced interaction capabilities
Training Methodology
- Supervised Fine-tuning: Initial training on curated GUI grounding and navigation data
- Online Reinforcement Learning: Continuous improvement through interaction with live environments
- Self-evolving pipeline: Autonomous data generation and quality improvement
- Multi-dimensional integration: User interactions, MCP tool calls, and traditional GUI operations
Deployment System
- Hybrid architecture: Seamless integration of on-device and cloud models
- Task-aware routing: Intelligent decision-making based on task complexity and privacy requirements
- Privacy-first design: Sensitive operations remain local while complex tasks leverage cloud capabilities
- Cost optimization: Efficient resource utilization through intelligent workload distribution
Real-World Applications
Home & Personal Use
- Smart shopping: Proactive suggestions based on calendar integration
- Task automation: Complex multi-app workflows for daily activities
- Contextual assistance: Understanding user intent through natural conversation
Professional & Office Use
- Document management: Intelligent file handling and sharing
- Communication assistance: Email composition with context awareness
- Cross-app integration: Seamless workflows across multiple applications
Navigation & Location Services
- Route planning: Integration with mapping services through MCP tools
- Location-aware suggestions: Context-sensitive recommendations
- Multi-modal transportation: Support for various transportation methods
Technical Specifications
Requirements
- vLLM: Version ≥0.11.0
- Transformers: Version ≥4.57.0
- Python: Compatible with standard ML ecosystem
- Hardware: Scalable from mobile devices to cloud infrastructure
Available Models
- MAI-UI-2B: Lightweight model for resource-constrained environments
- MAI-UI-8B: Balanced performance and efficiency
- Larger variants: 32B and 235B-A22B for maximum capability
Integration Options
- API service: OpenAI-compatible interface through vLLM
- Direct integration: Python SDK for custom applications
- Container deployment: Docker support for scalable deployment
Research Impact
Benchmark Leadership
MAI-UI establishes new state-of-the-art performance across multiple authoritative benchmarks, demonstrating both theoretical advancement and practical applicability.
Methodological Contributions
- Device-cloud collaboration: Novel deployment architecture for GUI agents
- Self-evolving data: Autonomous improvement of training datasets
- Extended interaction model: Native support for user dialogue and tool integration
Industry Applications
The project addresses real-world deployment challenges that have historically limited GUI agent adoption, making it suitable for production environments.
Open Source Commitment
Licensing
- Apache License 2.0: Permissive licensing for commercial and research use
- Third-party components: Clearly documented with appropriate attributions
- Community contribution: Open development model encouraging collaboration
Available Resources
- Models: MAI-UI-2B and MAI-UI-8B on Hugging Face
- Code: Complete implementation on GitHub
- Documentation: Comprehensive technical reports and usage guides
- Benchmarks: MobileWorld benchmark for evaluation
Future Directions
Research Extensions
- Larger model variants: Continued development of 32B and 235B models
- Cross-platform support: Extension beyond Android to iOS and desktop platforms
- Enhanced tool integration: Broader MCP tool ecosystem
Commercial Applications
- Enterprise deployment: Integration with business workflows
- Accessibility solutions: Assistance for users with disabilities
- Productivity enhancement: Advanced automation for knowledge workers
Citation Information
@misc{zhou2025maiuitechnicalreportrealworld,
title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents},
author={Hanzhang Zhou and Xu Zhang and Panrong Tong and Jianan Zhang and Liangyu Chen and Quyu Kong and Chenglin Cai and Chen Liu and Yue Wang and Jingren Zhou and Steven Hoi},
year={2025},
eprint={2512.22047},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.22047}
}
Contact Information
- Project Lead: Hanzhang Zhou (hanzhang.zhou@alibaba-inc.com)
- Technical Lead: Xu Zhang (hanguang.zx@alibaba-inc.com)
- Research Director: Yue Wang (yue.w@alibaba-inc.com)
- Institution: Tongyi Lab, Alibaba Group
Additional Resources
- Project Website: https://tongyi-mai.github.io/MAI-UI/
- GitHub Repository: https://github.com/Tongyi-MAI/MAI-UI
- Hugging Face Models: https://huggingface.co/Tongyi-MAI
- Technical Paper: https://arxiv.org/abs/2512.22047
- MobileWorld Benchmark: https://github.com/Tongyi-MAI/MobileWorld