ChatPaper.aiChatPaper

突破数据壁垒——通过任务泛化构建GUI智能体

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

April 14, 2025
作者: Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
cs.AI

摘要

图形用户界面(GUI)代理为自动化复杂数字任务提供了跨平台解决方案,具有显著提升生产力工作流程的潜力。然而,其性能往往受限于高质量轨迹数据的稀缺。为应对这一局限,我们提出在专门的中期训练阶段,利用数据丰富、推理密集的任务来训练视觉语言模型(VLMs),并探究这些任务如何促进向GUI规划场景的泛化。具体而言,我们探索了一系列易于获取指令微调数据的任务,包括GUI感知、多模态推理和文本推理。通过11项中期训练任务的广泛实验,我们得出以下结论:(1)任务泛化效果显著,在多数场景下带来大幅提升。例如,多模态数学推理使AndroidWorld上的性能绝对提升了6.3%。值得注意的是,纯文本数学数据显著增强了GUI网页代理的表现,在WebArena和AndroidWorld上分别实现了5.6%和5.4%的提升,凸显了从文本到视觉领域的跨模态泛化能力;(2)与先前假设相反,GUI感知数据——曾被认为与GUI代理任务高度相关并广泛用于训练——对最终性能的影响相对有限;(3)基于这些发现,我们筛选出最有效的中期训练任务,并构建了优化的混合数据集,使WebArena和AndroidWorld上的绝对性能分别提升了8.0%和12.2%。本研究为GUI代理的跨领域知识迁移提供了宝贵见解,并为解决这一新兴领域的数据稀缺问题提供了实用方法。代码、数据和模型将发布于https://github.com/hkust-nlp/GUIMid。
English
Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

Summary

AI-Generated Summary

PDF172April 15, 2025