ChatPaper.aiChatPaper

OS-ATLAS:通用GUI代理的基础动作模型

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

October 30, 2024
作者: Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao
cs.AI

摘要

在构建GUI代理方面的现有努力严重依赖于稳健的商业视觉-语言模型(VLMs),如GPT-4o和GeminiProVision。从业者通常不愿使用开源VLMs,因为它们在GUI基础和超出分布(OOD)场景中与闭源对应物相比性能显著滞后。为促进该领域未来的研究,我们开发了OS-Atlas - 一种在GUI基础和OOD代理任务中通过数据和建模创新表现出色的基础GUI动作模型。我们在开发一个开源工具包方面投入了大量工程努力,用于在多个平台上合成GUI基础数据,包括Windows、Linux、MacOS、Android和Web。利用这个工具包,我们发布了迄今为止最大的开源跨平台GUI基础语料库,其中包含超过1300万个GUI元素。结合模型训练的创新,这个数据集为OS-Atlas理解GUI截图并推广到未见界面提供了坚实基础。通过在涵盖移动、桌面和Web三个不同平台的六个基准上进行广泛评估,OS-Atlas展示了明显优于先前最先进模型的性能改进。我们的评估还揭示了有关持续改进和扩展开源VLMs代理能力的宝贵见解。
English
Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

Summary

AI-Generated Summary

PDF513November 13, 2024