ChatPaper.aiChatPaper

ShowUI:一种面向GUI视觉代理的视觉-语言-动作模型

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

November 26, 2024
作者: Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
cs.AI

摘要

构建图形用户界面(GUI)助手在提高人类工作流生产力方面具有重要潜力。虽然大多数代理是基于语言的,依赖于具有文本丰富元信息的闭源API(例如HTML或可访问性树),但它们在感知UI视觉方面存在局限,突显了对GUI视觉代理的需求。在这项工作中,我们在数字世界中开发了一种名为ShowUI的视觉-语言-动作模型,具有以下创新:(i)UI引导的视觉标记选择,通过将屏幕截图构建为UI连接图,自适应地识别它们之间的冗余关系,并作为自注意力块期间标记选择的标准,以降低计算成本;(ii)交织的视觉-语言-动作流,灵活地统一GUI任务中的多样需求,实现对导航中的视觉-动作历史或配对多轮查询-动作序列进行有效管理,以增强训练效率;(iii)通过精心策划数据和采用重新采样策略,构建小规模高质量的GUI指令遵循数据集,以解决重要数据类型不平衡的问题。通过以上组件,ShowUI,一个使用256K数据的轻量级2B模型,在零-shot截图定位中实现了强大的75.1%准确率。其UI引导的标记选择在训练期间进一步减少了33%的冗余视觉标记,并将性能提升了1.4倍。在Web Mind2Web、移动AITW和在线MiniWob环境中的导航实验进一步突显了我们模型在推进GUI视觉代理方面的有效性和潜力。这些模型可在https://github.com/showlab/ShowUI 上获得。
English
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.

Summary

AI-Generated Summary

PDF873November 27, 2024