OS-ATLAS: 일반 GUI 에이전트를 위한 기초 행동 모델

초록

기존 GUI 에이전트 구축 노력은 GPT-4o 및 GeminiProVision과 같은 견고한 상용 Vision-Language Models (VLMs)의 이용에 크게 의존한다. 실무자들은 GUI grounding 및 Out-Of-Distribution (OOD) 시나리오에서의 성능 차이로 인해 오픈 소스 VLMs를 사용하기 꺼려한다. 이 분야의 미래 연구를 촉진하기 위해, 우리는 GUI grounding 및 OOD 에이전트 작업에서 우수한 기반 GUI action model인 OS-Atlas을 개발했다. 데이터와 모델링의 혁신을 통해 GUI grounding 데이터를 다양한 플랫폼(Windows, Linux, MacOS, Android 및 웹)에서 합성하는 오픈 소스 툴킷을 개발하는 데 상당한 엔지니어링 노력을 투자했다. 이 툴킷을 활용하여, 우리는 오늘까지 가장 큰 오픈 소스 크로스 플랫폼 GUI grounding 말뭉치를 공개하고 있으며, 이는 1300만 개 이상의 GUI 요소를 포함하고 있다. 이 데이터셋은 모델 훈련의 혁신과 결합하여, OS-Atlas가 GUI 스크린샷을 이해하고 보이지 않는 인터페이스에 일반화하는 데 견고한 기반을 제공한다. 모바일, 데스크탑 및 웹을 포괄하는 여섯 가지 벤치마크를 통해 광범위한 평가를 거쳐, OS-Atlas은 이전 최첨단 모델에 비해 상당한 성능 향상을 보여준다. 우리의 평가는 오픈 소스 VLMs의 에이전트 능력을 지속적으로 향상시키고 확장하는 데 유용한 통찰력을 제공한다.

English

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

OS-ATLAS: 일반 GUI 에이전트를 위한 기초 행동 모델

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

초록

Support