如人類般在數位世界中導航：GUI 代理程式的通用視覺基礎

摘要

多模式大型語言模型（MLLMs）正在改變圖形用戶界面（GUI）代理的能力，促使它們從受控模擬轉變為跨各種平台的複雜現實應用。然而，這些代理的效力取決於它們的穩固基礎能力。目前的GUI代理主要利用基於文本的表示，如HTML或可訪問性樹，儘管這些表示具有實用性，但往往會引入噪音、不完整性和增加計算負擔。在本文中，我們主張為GUI代理提倡一種類似人類的具體化，使其完全以視覺方式感知環境並直接對GUI進行像素級操作。關鍵在於視覺基礎模型，它能夠將GUI元素的各種指稱表達準確地映射到不同平台上的GUI坐標。我們展示了一個簡單的方法，其中包括基於Web的合成數據和對LLaVA架構的輕微調整，對於訓練這種視覺基礎模型是非常有效的。我們迄今為止收集了最大的GUI視覺基礎數據集，包含1000萬個GUI元素及其對應的表達，涵蓋130萬個屏幕截圖，並用它來訓練UGround，一個強大的通用視覺基礎模型，適用於GUI代理。在涵蓋三個類別（基礎、離線代理和在線代理）的六個基準測試上的實證結果表明：1）UGround在GUI代理的視覺基礎模型方面明顯優於現有模型，絕對優勢高達20％；2）具有UGround的代理優於最先進的代理，儘管現有代理使用額外的基於文本的輸入，而我們的代理僅使用視覺感知。這些結果強有力地支持了像人類一樣在數字世界中導航的GUI代理的可行性和前景。

English

Multimodal large language models (MLLMs) are transforming the capabilities of graphical user interface (GUI) agents, facilitating their transition from controlled simulations to complex, real-world applications across various platforms. However, the effectiveness of these agents hinges on the robustness of their grounding capability. Current GUI agents predominantly utilize text-based representations such as HTML or accessibility trees, which, despite their utility, often introduce noise, incompleteness, and increased computational overhead. In this paper, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents. Empirical results on six benchmarks spanning three categories (grounding, offline agent, and online agent) show that 1) UGround substantially outperforms existing visual grounding models for GUI agents, by up to 20% absolute, and 2) agents with UGround outperform state-of-the-art agents, despite the fact that existing agents use additional text-based input while ours only uses visual perception. These results provide strong support for the feasibility and promises of GUI agents that navigate the digital world as humans do.

如人類般在數位世界中導航：GUI 代理程式的通用視覺基礎

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

摘要

Summary

Support

Support