ChatPaper.aiChatPaper

大型语言模型驱动的图形用户界面代理:一项调查

Large Language Model-Brained GUI Agents: A Survey

November 27, 2024
作者: Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
cs.AI

摘要

图形用户界面(GUI)长期以来一直是人机交互的核心,提供了直观且视觉驱动的方式来访问和与数字系统交互。LLM的出现,特别是多模态模型,开启了GUI自动化的新时代。它们在自然语言理解、代码生成和视觉处理方面展现了卓越的能力。这为一代新型LLM大脑GUI代理铺平了道路,能够解释复杂的GUI元素,并根据自然语言指令自主执行动作。这些代理代表了一种范式转变,使用户能够通过简单的对话命令执行复杂的多步任务。它们的应用涵盖了网页导航、移动应用程序交互和桌面自动化,提供了一种变革性的用户体验,彻底改变了个人与软件的互动方式。这一新兴领域正在迅速发展,无论是在研究还是工业界都取得了重大进展。 为了系统地理解这一趋势,本文提出了LLM大脑GUI代理的综合调查,探讨了它们的历史演变、核心组件和先进技术。我们探讨了诸如现有GUI代理框架、为训练专门的GUI代理收集和利用数据、为GUI任务开发大型动作模型以及评估指标和基准的研究问题,以评估它们的有效性。此外,我们还研究了由这些代理驱动的新兴应用。通过详细分析,本调查确定了关键的研究空白,并概述了未来该领域的发展路线。通过整合基础知识和最新发展,本研究旨在指导研究人员和从业者克服挑战,释放LLM大脑GUI代理的全部潜力。
English
GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.

Summary

AI-Generated Summary

PDF323November 28, 2024