Aguvis:統一的純視覺代理人用於自主GUI互動
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
December 5, 2024
作者: Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
cs.AI
摘要
圖形使用者介面(GUI)對人機互動至關重要,然而由於視覺環境的複雜性和變異性,自動化 GUI 任務仍然具有挑戰性。現有方法通常依賴 GUI 的文本表示,這會在泛化、效率和可擴展性方面帶來限制。本文介紹 Aguvis,一個統一的純視覺框架,用於跨不同平台運行的自主 GUI 代理。我們的方法利用基於圖像的觀察,將指令與自然語言中的視覺元素相結合,並使用一致的動作空間以確保跨平台泛化。為了解決先前工作的限制,我們在模型中整合了明確的規劃和推理,增強了其自主導航和與複雜數字環境互動的能力。我們構建了一個大規模的 GUI 代理軌跡數據集,融合了多模態推理和基礎知識,並採用了兩階段訓練流程,首先專注於一般 GUI 基礎知識,然後進行規劃和推理。通過全面的實驗,我們展示了 Aguvis 在離線和實時在線場景中均超越了先前的最新方法,實現了我們所知的第一個完全自主的純視覺 GUI 代理,能夠獨立執行任務,無需與外部封閉源模型合作。我們已將所有數據集、模型和訓練配方開源,以促進未來研究,網址為 https://aguvis-project.github.io/。
English
Graphical User Interfaces (GUIs) are critical to human-computer interaction,
yet automating GUI tasks remains challenging due to the complexity and
variability of visual environments. Existing approaches often rely on textual
representations of GUIs, which introduce limitations in generalization,
efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure
vision-based framework for autonomous GUI agents that operates across various
platforms. Our approach leverages image-based observations, and grounding
instructions in natural language to visual elements, and employs a consistent
action space to ensure cross-platform generalization. To address the
limitations of previous work, we integrate explicit planning and reasoning
within the model, enhancing its ability to autonomously navigate and interact
with complex digital environments. We construct a large-scale dataset of GUI
agent trajectories, incorporating multimodal reasoning and grounding, and
employ a two-stage training pipeline that first focuses on general GUI
grounding, followed by planning and reasoning. Through comprehensive
experiments, we demonstrate that Aguvis surpasses previous state-of-the-art
methods in both offline and real-world online scenarios, achieving, to our
knowledge, the first fully autonomous pure vision GUI agent capable of
performing tasks independently without collaboration with external
closed-source models. We open-sourced all datasets, models, and training
recipes to facilitate future research at https://aguvis-project.github.io/.Summary
AI-Generated Summary