Tongyi-MAI/MAI-UI View GitHub Homepage for Latest Official Releases

Real-world centric foundation GUI agents with native user interaction, MCP tool integration, and device-cloud collaboration capabilities

Apache-2.0Jupyter NotebookMAI-UITongyi-MAI 1.6k Last Updated: January 15, 2026

MAI-UI: Real-World Centric Foundation GUI Agents

Overview

MAI-UI is a comprehensive family of foundation GUI agents developed by Alibaba's Tongyi Lab that spans the full spectrum of model sizes from 2B to 235B-A22B parameters. The project represents a significant advancement in making GUI agents practical for real-world deployment through innovative approaches to user interaction, tool integration, and deployment architecture.

Key Features & Innovations

1. Multi-Scale Foundation Models

Model Variants: 2B, 8B, 32B, and 235B-A22B parameters
Base Architecture: Built on Qwen3-VL multimodal large language models
Training Approach: Joint supervised fine-tuning and reinforcement learning
Deployment Flexibility: Suitable for various hardware constraints and performance requirements

2. Extended Action Space

MAI-UI introduces three critical capabilities beyond traditional GUI operations:

Agent-User Interaction

ask_user action: Proactively requests clarification for ambiguous instructions
Dynamic conversation: Handles incomplete or unclear user requirements
Real-world applicability: Addresses the common scenario where user instructions lack specificity

MCP Tool Integration

mcp_call action: Direct invocation of external tools through Model Context Protocol
API-level operations: Efficient alternatives to complex UI manipulations
Enhanced functionality: Access to services like mapping, file management, and data retrieval

Device-Cloud Collaboration

Intelligent routing: Dynamic selection between on-device and cloud execution
Privacy preservation: Keeps sensitive operations local while leveraging cloud for complex tasks
Cost optimization: Reduces cloud API calls by over 40%

3. Self-Evolving Data Pipeline

Autonomous data generation: Continuous improvement of training corpus
Multi-agent collaboration: Combination of human annotations and model-generated trajectories
Quality filtering: Judge models evaluate and retain high-quality execution paths
Dynamic adaptation: Training data evolves with model capabilities

4. Large-Scale Online Reinforcement Learning

Massive parallelization: Up to 512 parallel Android environments
Extended context: Support for up to 50 environment steps
Significant improvements: +5.2 points from environment scaling, +4.3 points from step budget increase
Real-world robustness: Training in dynamic environments with pop-ups, ads, and UI changes

Performance Achievements

GUI Grounding Benchmarks

ScreenSpot-Pro: 73.5% accuracy (surpasses Gemini-3-Pro and Seed1.8)
MMBench GUI L2: 91.3% accuracy
OSWorld-G: 70.9% accuracy
UI-Vision: 49.2% accuracy

Mobile Navigation Benchmarks

AndroidWorld: 76.7% success rate (new SOTA, surpassing UI-Tars-2, Gemini-2.5-Pro, and Seed1.8)
MobileWorld: 41.7% success rate (20.8 point improvement over strongest baselines)

Device-Cloud Collaboration Results

Performance improvement: 33% enhancement in on-device performance
Cost reduction: Over 40% reduction in cloud model calls
Privacy preservation: 40.5% of tasks completed entirely on-device

Technical Architecture

Model Foundation

Backbone: Qwen3-VL multimodal architecture
Input modalities: Natural language instructions and rendered UI screenshots
Output: Structured actions for live Android devices
Action space: Click, swipe, text input, system buttons, plus enhanced interaction capabilities

Training Methodology

Supervised Fine-tuning: Initial training on curated GUI grounding and navigation data
Online Reinforcement Learning: Continuous improvement through interaction with live environments
Self-evolving pipeline: Autonomous data generation and quality improvement
Multi-dimensional integration: User interactions, MCP tool calls, and traditional GUI operations

Deployment System

Hybrid architecture: Seamless integration of on-device and cloud models
Task-aware routing: Intelligent decision-making based on task complexity and privacy requirements
Privacy-first design: Sensitive operations remain local while complex tasks leverage cloud capabilities
Cost optimization: Efficient resource utilization through intelligent workload distribution

Real-World Applications

Home & Personal Use

Smart shopping: Proactive suggestions based on calendar integration
Task automation: Complex multi-app workflows for daily activities
Contextual assistance: Understanding user intent through natural conversation

Professional & Office Use

Document management: Intelligent file handling and sharing
Communication assistance: Email composition with context awareness
Cross-app integration: Seamless workflows across multiple applications

Navigation & Location Services

Route planning: Integration with mapping services through MCP tools
Location-aware suggestions: Context-sensitive recommendations
Multi-modal transportation: Support for various transportation methods

Technical Specifications

Requirements

vLLM: Version ≥0.11.0
Transformers: Version ≥4.57.0
Python: Compatible with standard ML ecosystem
Hardware: Scalable from mobile devices to cloud infrastructure

Available Models

MAI-UI-2B: Lightweight model for resource-constrained environments
MAI-UI-8B: Balanced performance and efficiency
Larger variants: 32B and 235B-A22B for maximum capability

Integration Options

API service: OpenAI-compatible interface through vLLM
Direct integration: Python SDK for custom applications
Container deployment: Docker support for scalable deployment

Research Impact

Benchmark Leadership

MAI-UI establishes new state-of-the-art performance across multiple authoritative benchmarks, demonstrating both theoretical advancement and practical applicability.

Methodological Contributions

Device-cloud collaboration: Novel deployment architecture for GUI agents
Self-evolving data: Autonomous improvement of training datasets
Extended interaction model: Native support for user dialogue and tool integration

Industry Applications

The project addresses real-world deployment challenges that have historically limited GUI agent adoption, making it suitable for production environments.

Open Source Commitment

Licensing

Apache License 2.0: Permissive licensing for commercial and research use
Third-party components: Clearly documented with appropriate attributions
Community contribution: Open development model encouraging collaboration

Available Resources

Models: MAI-UI-2B and MAI-UI-8B on Hugging Face
Code: Complete implementation on GitHub
Documentation: Comprehensive technical reports and usage guides
Benchmarks: MobileWorld benchmark for evaluation

Future Directions

Research Extensions

Larger model variants: Continued development of 32B and 235B models
Cross-platform support: Extension beyond Android to iOS and desktop platforms
Enhanced tool integration: Broader MCP tool ecosystem

Commercial Applications

Enterprise deployment: Integration with business workflows
Accessibility solutions: Assistance for users with disabilities
Productivity enhancement: Advanced automation for knowledge workers

Citation Information

@misc{zhou2025maiuitechnicalreportrealworld,
  title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents},
  author={Hanzhang Zhou and Xu Zhang and Panrong Tong and Jianan Zhang and Liangyu Chen and Quyu Kong and Chenglin Cai and Chen Liu and Yue Wang and Jingren Zhou and Steven Hoi},
  year={2025},
  eprint={2512.22047},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.22047}
}

Contact Information

Project Lead: Hanzhang Zhou (hanzhang.zhou@alibaba-inc.com)
Technical Lead: Xu Zhang (hanguang.zx@alibaba-inc.com)
Research Director: Yue Wang (yue.w@alibaba-inc.com)
Institution: Tongyi Lab, Alibaba Group

Additional Resources

Project Website: https://tongyi-mai.github.io/MAI-UI/
GitHub Repository: https://github.com/Tongyi-MAI/MAI-UI
Hugging Face Models: https://huggingface.co/Tongyi-MAI
Technical Paper: https://arxiv.org/abs/2512.22047
MobileWorld Benchmark: https://github.com/Tongyi-MAI/MobileWorld