ChatPaper.aiChatPaper

LearnAct:基于统一演示基准的少样本移动GUI智能体

LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

April 18, 2025
作者: Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, Wenchao Meng
cs.AI

摘要

移动GUI代理在自动化任务方面展现出潜力,但在多样化的现实场景中面临泛化挑战。传统方法依赖大规模数据集进行预训练或微调,难以应对移动应用程序和用户特定任务的多样性。我们提出通过人类演示来增强移动GUI代理的能力,重点提升其在未见场景中的表现,而非追求通过更大数据集实现普遍泛化。为实现这一范式,我们引入了LearnGUI,这是首个专门为研究基于演示学习的移动GUI代理而设计的综合数据集,包含2,252个离线任务和101个在线任务,均配有高质量的人类演示。我们进一步开发了LearnAct,一个复杂的多代理框架,能够自动从演示中提取知识以提升任务完成度。该框架集成了三个专门代理:DemoParser用于知识提取,KnowSeeker用于相关知识检索,ActExecutor用于基于演示的任务执行。实验结果表明,在离线和在线评估中均取得了显著的性能提升。在离线评估中,单次演示使模型性能提升,将Gemini-1.5-Pro的准确率从19.3%提高至51.7%。在线评估中,我们的框架将UI-TARS-7B-SFT的任务成功率从18.1%提升至32.8%。LearnAct框架和LearnGUI基准确立了基于演示的学习作为实现更适应性强、个性化且可部署的移动GUI代理的有前景方向。
English
Mobile GUI agents show promise in automating tasks but face generalization challenges in diverse real-world scenarios. Traditional approaches using pre-training or fine-tuning with massive datasets struggle with the diversity of mobile applications and user-specific tasks. We propose enhancing mobile GUI agent capabilities through human demonstrations, focusing on improving performance in unseen scenarios rather than pursuing universal generalization through larger datasets. To realize this paradigm, we introduce LearnGUI, the first comprehensive dataset specifically designed for studying demonstration-based learning in mobile GUI agents, comprising 2,252 offline tasks and 101 online tasks with high-quality human demonstrations. We further develop LearnAct, a sophisticated multi-agent framework that automatically extracts knowledge from demonstrations to enhance task completion. This framework integrates three specialized agents: DemoParser for knowledge extraction, KnowSeeker for relevant knowledge retrieval, and ActExecutor for demonstration-enhanced task execution. Our experimental results show significant performance gains in both offline and online evaluations. In offline assessments, a single demonstration improves model performance, increasing Gemini-1.5-Pro's accuracy from 19.3% to 51.7%. In online evaluations, our framework enhances UI-TARS-7B-SFT's task success rate from 18.1% to 32.8%. LearnAct framework and LearnGUI benchmark establish demonstration-based learning as a promising direction for more adaptable, personalized, and deployable mobile GUI agents.

Summary

AI-Generated Summary

PDF102April 22, 2025