Aguvis:用于自主图形用户界面交互的统一纯视觉代理
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
December 5, 2024
作者: Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
cs.AI
摘要
图形用户界面(GUI)对人机交互至关重要,然而由于视觉环境的复杂性和多变性,自动化GUI任务仍然具有挑战性。现有方法通常依赖于GUI的文本表示,这在泛化、效率和可扩展性方面存在局限性。在本文中,我们介绍了Aguvis,这是一个统一的基于纯视觉的自主GUI代理框架,可跨越各种平台运行。我们的方法利用基于图像的观察,并将指令与自然语言中的视觉元素进行关联,并采用一致的动作空间以确保跨平台泛化。为了解决先前工作的局限性,我们在模型内部集成了明确的规划和推理,增强了其自主导航和与复杂数字环境交互的能力。我们构建了一个大规模的GUI代理轨迹数据集,融合了多模态推理和关联,并采用了一个两阶段的训练流程,首先专注于一般GUI关联,然后进行规划和推理。通过全面的实验,我们证明Aguvis在离线和实时在线场景中均超越了先前的最新方法,实现了我们所知的第一个完全自主的纯视觉GUI代理,能够独立执行任务,无需与外部闭源模型合作。我们已开源所有数据集、模型和训练方法,以促进未来研究,网址为https://aguvis-project.github.io/。
English
Graphical User Interfaces (GUIs) are critical to human-computer interaction,
yet automating GUI tasks remains challenging due to the complexity and
variability of visual environments. Existing approaches often rely on textual
representations of GUIs, which introduce limitations in generalization,
efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure
vision-based framework for autonomous GUI agents that operates across various
platforms. Our approach leverages image-based observations, and grounding
instructions in natural language to visual elements, and employs a consistent
action space to ensure cross-platform generalization. To address the
limitations of previous work, we integrate explicit planning and reasoning
within the model, enhancing its ability to autonomously navigate and interact
with complex digital environments. We construct a large-scale dataset of GUI
agent trajectories, incorporating multimodal reasoning and grounding, and
employ a two-stage training pipeline that first focuses on general GUI
grounding, followed by planning and reasoning. Through comprehensive
experiments, we demonstrate that Aguvis surpasses previous state-of-the-art
methods in both offline and real-world online scenarios, achieving, to our
knowledge, the first fully autonomous pure vision GUI agent capable of
performing tasks independently without collaboration with external
closed-source models. We open-sourced all datasets, models, and training
recipes to facilitate future research at https://aguvis-project.github.io/.Summary
AI-Generated Summary