ChatPaper.aiChatPaper

突破數據壁壘——通過任務泛化構建GUI代理

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

April 14, 2025
作者: Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
cs.AI

摘要

圖形用戶界面(GUI)代理提供了跨平台的自動化複雜數字任務的解決方案,具有顯著提升生產力工作流程的潛力。然而,其性能往往受限於高質量軌跡數據的稀缺性。為解決這一限制,我們提出在專門的中期訓練階段,針對數據豐富且需要密集推理的任務訓練視覺語言模型(VLMs),並探討這些任務如何促進向GUI規劃場景的泛化。具體而言,我們探索了一系列易於獲取指令微調數據的任務,包括GUI感知、多模態推理和文本推理。通過在11個中期訓練任務上的廣泛實驗,我們證明:(1)任務泛化極為有效,在大多數設置中帶來顯著提升。例如,多模態數學推理使AndroidWorld的性能絕對提升了6.3%。值得注意的是,僅文本的數學數據顯著提升了GUI網絡代理的性能,在WebArena上實現了5.6%的提升,在AndroidWorld上實現了5.4%的提升,凸顯了從文本到視覺領域的顯著跨模態泛化能力;(2)與先前假設相反,GUI感知數據——曾被認為與GUI代理任務密切相關並廣泛用於訓練——對最終性能的影響相對有限;(3)基於這些洞察,我們確定了最有效的中期訓練任務,並精心優化了混合數據集,從而在WebArena上實現了8.0%的絕對性能提升,在AndroidWorld上實現了12.2%的提升。我們的工作為GUI代理的跨領域知識遷移提供了寶貴的見解,並為解決這一新興領域中的數據稀缺挑戰提供了實用方法。代碼、數據和模型將在https://github.com/hkust-nlp/GUIMid上公開。
English
Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

Summary

AI-Generated Summary

PDF152April 15, 2025