GUI代理:一项调查
GUI Agents: A Survey
December 18, 2024
作者: Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt
cs.AI
摘要
由大型基础模型驱动的图形用户界面(GUI)代理已经成为自动化人机交互的变革性方法。这些代理通过GUI与数字系统或软件应用程序自主交互,模拟人类动作,如点击、输入和在不同平台上导航视觉元素。受到对GUI代理日益增长的兴趣和基本重要性的推动,我们提供了一项全面的调查,对它们的基准、评估指标、架构和训练方法进行了分类。我们提出了一个统一的框架,详细描述了它们的感知、推理、规划和行动能力。此外,我们确定了重要的挑战,并讨论了关键的未来方向。最后,这项工作为从业者和研究人员提供了一个基础,使他们能够直观地了解当前进展、技术、基准和尚待解决的关键问题。
English
Graphical User Interface (GUI) agents, powered by Large Foundation Models,
have emerged as a transformative approach to automating human-computer
interaction. These agents autonomously interact with digital systems or
software applications via GUIs, emulating human actions such as clicking,
typing, and navigating visual elements across diverse platforms. Motivated by
the growing interest and fundamental importance of GUI agents, we provide a
comprehensive survey that categorizes their benchmarks, evaluation metrics,
architectures, and training methods. We propose a unified framework that
delineates their perception, reasoning, planning, and acting capabilities.
Furthermore, we identify important open challenges and discuss key future
directions. Finally, this work serves as a basis for practitioners and
researchers to gain an intuitive understanding of current progress, techniques,
benchmarks, and critical open problems that remain to be addressed.Summary
AI-Generated Summary