아구비스: 자율적 GUI 상호작용을 위한 통합 순수 비전 에이전트

초록

그래픽 사용자 인터페이스(GUI)는 인간-컴퓨터 상호작용에 중요하지만, 시각 환경의 복잡성과 다양성으로 인해 GUI 작업을 자동화하는 것은 여전히 어려움을 겪고 있습니다. 기존 접근 방식은 주로 GUI의 텍스트 표현을 활용하는데, 이는 일반화, 효율성 및 확장성에 제약을 가하는 한계를 가지고 있습니다. 본 논문에서는 다양한 플랫폼에서 작동하는 자율 GUI 에이전트를 위한 통합 순수 시각 기반 프레임워크인 Aguvis를 소개합니다. 저희 방법은 이미지 기반 관측을 활용하고 자연어로 시각 요소에 대한 지시사항을 근거로 삼으며, 일관된 행동 공간을 활용하여 플랫폼 간 일반화를 보장합니다. 이전 작업의 한계를 해결하기 위해 명시적인 계획 및 추론을 모델 내에 통합하여 복잡한 디지털 환경에서 자율적으로 탐색하고 상호작용할 수 있는 능력을 향상시켰습니다. GUI 에이전트 궤적의 대규모 데이터셋을 구축하고, 다중 모달 추론 및 근거를 통합하며, 먼저 일반 GUI 근거에 중점을 둔 두 단계의 교육 파이프라인을 활용합니다. 포괄적인 실험을 통해, 우리는 Aguvis가 오프라인 및 실제 온라인 시나리오에서 이전 최첨단 방법을 능가하며, 외부 폐쇄 소스 모델과의 협업 없이 독립적으로 작업을 수행할 수 있는 최초의 완전 자율 순수 시각 GUI 에이전트임을 입증합니다. 우리는 모든 데이터셋, 모델 및 교육 레시피를 오픈소스로 제공하여 앞으로의 연구를 촉진합니다. https://aguvis-project.github.io/에서 확인할 수 있습니다.

English

Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements, and employs a consistent action space to ensure cross-platform generalization. To address the limitations of previous work, we integrate explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. We construct a large-scale dataset of GUI agent trajectories, incorporating multimodal reasoning and grounding, and employ a two-stage training pipeline that first focuses on general GUI grounding, followed by planning and reasoning. Through comprehensive experiments, we demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving, to our knowledge, the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. We open-sourced all datasets, models, and training recipes to facilitate future research at https://aguvis-project.github.io/.

아구비스: 자율적 GUI 상호작용을 위한 통합 순수 비전 에이전트

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

초록

Support