突破數據壁壘——通過任務泛化構建GUI代理
Breaking the Data Barrier -- Building GUI Agents Through Task Generalization
April 14, 2025
作者: Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
cs.AI
摘要
圖形用戶界面(GUI)代理提供了跨平台的自動化複雜數字任務的解決方案,具有顯著提升生產力工作流程的潛力。然而,其性能往往受限於高質量軌跡數據的稀缺性。為解決這一限制,我們提出在專門的中期訓練階段,針對數據豐富且需要密集推理的任務訓練視覺語言模型(VLMs),並探討這些任務如何促進向GUI規劃場景的泛化。具體而言,我們探索了一系列易於獲取指令微調數據的任務,包括GUI感知、多模態推理和文本推理。通過在11個中期訓練任務上的廣泛實驗,我們證明:(1)任務泛化極為有效,在大多數設置中帶來顯著提升。例如,多模態數學推理使AndroidWorld的性能絕對提升了6.3%。值得注意的是,僅文本的數學數據顯著提升了GUI網絡代理的性能,在WebArena上實現了5.6%的提升,在AndroidWorld上實現了5.4%的提升,凸顯了從文本到視覺領域的顯著跨模態泛化能力;(2)與先前假設相反,GUI感知數據——曾被認為與GUI代理任務密切相關並廣泛用於訓練——對最終性能的影響相對有限;(3)基於這些洞察,我們確定了最有效的中期訓練任務,並精心優化了混合數據集,從而在WebArena上實現了8.0%的絕對性能提升,在AndroidWorld上實現了12.2%的提升。我們的工作為GUI代理的跨領域知識遷移提供了寶貴的見解,並為解決這一新興領域中的數據稀缺挑戰提供了實用方法。代碼、數據和模型將在https://github.com/hkust-nlp/GUIMid上公開。
English
Graphical User Interface (GUI) agents offer cross-platform solutions for
automating complex digital tasks, with significant potential to transform
productivity workflows. However, their performance is often constrained by the
scarcity of high-quality trajectory data. To address this limitation, we
propose training Vision Language Models (VLMs) on data-rich,
reasoning-intensive tasks during a dedicated mid-training stage, and then
examine how incorporating these tasks facilitates generalization to GUI
planning scenarios. Specifically, we explore a range of tasks with readily
available instruction-tuning data, including GUI perception, multimodal
reasoning, and textual reasoning. Through extensive experiments across 11
mid-training tasks, we demonstrate that: (1) Task generalization proves highly
effective, yielding substantial improvements across most settings. For
instance, multimodal mathematical reasoning enhances performance on
AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data
significantly boosts GUI web agent performance, achieving a 5.6% improvement on
WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal
generalization from text-based to visual domains; (2) Contrary to prior
assumptions, GUI perception data - previously considered closely aligned with
GUI agent tasks and widely utilized for training - has a comparatively limited
impact on final performance; (3) Building on these insights, we identify the
most effective mid-training tasks and curate optimized mixture datasets,
resulting in absolute performance gains of 8.0% on WebArena and 12.2% on
AndroidWorld. Our work provides valuable insights into cross-domain knowledge
transfer for GUI agents and offers a practical approach to addressing data
scarcity challenges in this emerging field. The code, data and models will be
available at https://github.com/hkust-nlp/GUIMid.Summary
AI-Generated Summary