GUI 代理:一項調查
GUI Agents: A Survey
December 18, 2024
作者: Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. Rossi, Franck Dernoncourt
cs.AI
摘要
由大型基礎模型驅動的圖形使用者介面(GUI)代理,已成為自動化人機互動的轉變性方法。這些代理通過GUI與數字系統或軟體應用自主互動,模擬人類動作,如點擊、輸入和在不同平台上導航視覺元素。受到對GUI代理日益增長的興趣和基本重要性的激勵,我們提供了一份全面的調查,將它們的基準、評估指標、架構和訓練方法進行分類。我們提出了一個統一的框架,描述了它們的感知、推理、規劃和行動能力。此外,我們確認了重要的開放挑戰,並討論了關鍵的未來方向。最後,這項工作為從業者和研究人員提供了一個基礎,以便直觀地了解目前的進展、技術、基準和仍待解決的關鍵問題。
English
Graphical User Interface (GUI) agents, powered by Large Foundation Models,
have emerged as a transformative approach to automating human-computer
interaction. These agents autonomously interact with digital systems or
software applications via GUIs, emulating human actions such as clicking,
typing, and navigating visual elements across diverse platforms. Motivated by
the growing interest and fundamental importance of GUI agents, we provide a
comprehensive survey that categorizes their benchmarks, evaluation metrics,
architectures, and training methods. We propose a unified framework that
delineates their perception, reasoning, planning, and acting capabilities.
Furthermore, we identify important open challenges and discuss key future
directions. Finally, this work serves as a basis for practitioners and
researchers to gain an intuitive understanding of current progress, techniques,
benchmarks, and critical open problems that remain to be addressed.Summary
AI-Generated Summary