ChatPaper.aiChatPaper

OS-ATLAS:通用 GUI 代理程式的基礎行動模型

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

October 30, 2024
作者: Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao
cs.AI

摘要

在建立 GUI 代理方面的現有努力嚴重依賴於穩健的商用視覺語言模型(VLMs),如 GPT-4o 和 GeminiProVision。從業者通常不願使用開源 VLMs,因為它們在 GUI 地面化和超出分佈(OOD)情境中與封閉源對應物相比表現明顯滯後。為了促進該領域的未來研究,我們開發了 OS-Atlas - 一個在數據和建模方面創新,擅長於 GUI 地面化和 OOD 代理任務的基礎 GUI 行動模型。我們在開發開源工具包方面投入了大量工程努力,用於在多個平台上合成 GUI 地面化數據,包括 Windows、Linux、MacOS、Android 和網頁。利用這個工具包,我們釋出迄今為止最大的開源跨平台 GUI 地面化語料庫,其中包含超過 1300 萬個 GUI 元素。這個數據集,結合模型訓練的創新,為 OS-Atlas 理解 GUI 截圖並推廣到未見界面提供了堅實基礎。通過在涵蓋三種不同平台(移動、桌面和網頁)的六個基準上進行廣泛評估,OS-Atlas 展示了比先前最先進模型顯著的性能改進。我們的評估還揭示了持續改進和擴展開源 VLMs 代理能力的寶貴見解。
English
Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

Summary

AI-Generated Summary

PDF513November 13, 2024