ShowUI:一個針對GUI視覺代理的Vision-Language-Action模型。
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
November 26, 2024
作者: Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
cs.AI
摘要
建構圖形使用者介面(GUI)助理對提升人類工作流生產力具有重要潛力。儘管大多數代理人是基於語言的,依賴於具有文本豐富元資訊的封閉式 API(例如 HTML 或可存取性樹),但它們在感知 UI 視覺方面與人類不同,凸顯了對 GUI 視覺代理人的需求。在這項工作中,我們在數位世界中開發了一個名為 ShowUI 的視覺-語言-動作模型,具有以下創新:(i)UI 引導的視覺標記選擇,通過將截圖制定為 UI 連接圖,自適應地識別其冗餘關係,並作為自注意力區塊期間標記選擇的標準;(ii)交錯的視覺-語言-動作串流,靈活地統一 GUI 任務中的多樣需求,使得在導航中有效管理視覺-動作歷史,或對每個截圖進行多輪查詢-動作序列配對,以增強訓練效率;(iii)小規模高質量 GUI 指示遵循數據集,通過精心的數據整理和採用重新取樣策略來解決重要數據類型不平衡的問題。憑藉上述組件,使用 256K 數據的輕量級 2B 模型 ShowUI 在零樣本截圖定位中實現了強大的 75.1% 準確率。其 UI 引導的標記選擇進一步在訓練過程中減少了 33% 的冗餘視覺標記,並將性能加速了 1.4 倍。在跨網頁 Mind2Web、移動 AITW 和在線 MiniWob 環境中的導航實驗進一步突顯了我們模型在推進 GUI 視覺代理人方面的效力和潛力。模型可在 https://github.com/showlab/ShowUI 上找到。
English
Building Graphical User Interface (GUI) assistants holds significant promise
for enhancing human workflow productivity. While most agents are
language-based, relying on closed-source API with text-rich meta-information
(e.g., HTML or accessibility tree), they show limitations in perceiving UI
visuals as humans do, highlighting the need for GUI visual agents. In this
work, we develop a vision-language-action model in digital world, namely
ShowUI, which features the following innovations: (i) UI-Guided Visual Token
Selection to reduce computational costs by formulating screenshots as an UI
connected graph, adaptively identifying their redundant relationship and serve
as the criteria for token selection during self-attention blocks; (ii)
Interleaved Vision-Language-Action Streaming that flexibly unifies diverse
needs within GUI tasks, enabling effective management of visual-action history
in navigation or pairing multi-turn query-action sequences per screenshot to
enhance training efficiency; (iii) Small-scale High-quality GUI
Instruction-following Datasets by careful data curation and employing a
resampling strategy to address significant data type imbalances. With above
components, ShowUI, a lightweight 2B model using 256K data, achieves a strong
75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection
further reduces 33% of redundant visual tokens during training and speeds up
the performance by 1.4x. Navigation experiments across web Mind2Web, mobile
AITW, and online MiniWob environments further underscore the effectiveness and
potential of our model in advancing GUI visual agents. The models are available
at https://github.com/showlab/ShowUI.Summary
AI-Generated Summary