突破數據壁壘——通過任務泛化構建GUI代理

摘要

圖形用戶界面（GUI）代理提供了跨平台的自動化複雜數字任務的解決方案，具有顯著提升生產力工作流程的潛力。然而，其性能往往受限於高質量軌跡數據的稀缺性。為解決這一限制，我們提出在專門的中期訓練階段，針對數據豐富且需要密集推理的任務訓練視覺語言模型（VLMs），並探討這些任務如何促進向GUI規劃場景的泛化。具體而言，我們探索了一系列易於獲取指令微調數據的任務，包括GUI感知、多模態推理和文本推理。通過在11個中期訓練任務上的廣泛實驗，我們證明：（1）任務泛化極為有效，在大多數設置中帶來顯著提升。例如，多模態數學推理使AndroidWorld的性能絕對提升了6.3%。值得注意的是，僅文本的數學數據顯著提升了GUI網絡代理的性能，在WebArena上實現了5.6%的提升，在AndroidWorld上實現了5.4%的提升，凸顯了從文本到視覺領域的顯著跨模態泛化能力；（2）與先前假設相反，GUI感知數據——曾被認為與GUI代理任務密切相關並廣泛用於訓練——對最終性能的影響相對有限；（3）基於這些洞察，我們確定了最有效的中期訓練任務，並精心優化了混合數據集，從而在WebArena上實現了8.0%的絕對性能提升，在AndroidWorld上實現了12.2%的提升。我們的工作為GUI代理的跨領域知識遷移提供了寶貴的見解，並為解決這一新興領域中的數據稀缺挑戰提供了實用方法。代碼、數據和模型將在https://github.com/hkust-nlp/GUIMid上公開。

English

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

突破數據壁壘——通過任務泛化構建GUI代理

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

摘要

Summary

Support

Support