大型語言模型驅動的 GUI 代理:一項調查
Large Language Model-Brained GUI Agents: A Survey
November 27, 2024
作者: Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
cs.AI
摘要
GUI(圖形用戶界面)長期以來一直是人機交互的核心,提供了直觀且視覺化的方式來訪問和與數字系統互動。LLM(大型語言模型)的出現,特別是多模型,開啟了GUI自動化的新時代。它們在自然語言理解、代碼生成和視覺處理方面展示了卓越的能力。這為一代新型的LLM大腦GUI代理鋪平了道路,能夠解釋複雜的GUI元素並根據自然語言指令自主執行操作。這些代理代表了一種範式轉變,使用戶能夠通過簡單的對話命令執行複雜的多步任務。它們的應用涵蓋網頁導航、移動應用程序交互和桌面自動化,提供了一種革命性的用戶體驗,徹底改變了個人與軟件互動的方式。這一新興領域正在迅速發展,無論在研究還是行業中都取得了顯著進展。
為了對這一趨勢提供結構化理解,本文提出了對LLM大腦GUI代理的全面調查,探討它們的歷史演變、核心組件和高級技術。我們探討了研究問題,如現有的GUI代理框架、為訓練專門的GUI代理收集和利用數據、為GUI任務量身定制的大型動作模型的開發,以及評估其有效性所需的評估指標和基準。此外,我們還研究了由這些代理驅動的新興應用。通過詳細分析,這項調查確定了關鍵的研究空白,並概述了未來在該領域取得進展的路徑。通過整合基礎知識和最新發展,本研究旨在引導研究人員和從業者克服挑戰,發揮LLM大腦GUI代理的全部潛力。
English
GUIs have long been central to human-computer interaction, providing an
intuitive and visually-driven way to access and interact with digital systems.
The advent of LLMs, particularly multimodal models, has ushered in a new era of
GUI automation. They have demonstrated exceptional capabilities in natural
language understanding, code generation, and visual processing. This has paved
the way for a new generation of LLM-brained GUI agents capable of interpreting
complex GUI elements and autonomously executing actions based on natural
language instructions. These agents represent a paradigm shift, enabling users
to perform intricate, multi-step tasks through simple conversational commands.
Their applications span across web navigation, mobile app interactions, and
desktop automation, offering a transformative user experience that
revolutionizes how individuals interact with software. This emerging field is
rapidly advancing, with significant progress in both research and industry.
To provide a structured understanding of this trend, this paper presents a
comprehensive survey of LLM-brained GUI agents, exploring their historical
evolution, core components, and advanced techniques. We address research
questions such as existing GUI agent frameworks, the collection and utilization
of data for training specialized GUI agents, the development of large action
models tailored for GUI tasks, and the evaluation metrics and benchmarks
necessary to assess their effectiveness. Additionally, we examine emerging
applications powered by these agents. Through a detailed analysis, this survey
identifies key research gaps and outlines a roadmap for future advancements in
the field. By consolidating foundational knowledge and state-of-the-art
developments, this work aims to guide both researchers and practitioners in
overcoming challenges and unlocking the full potential of LLM-brained GUI
agents.Summary
AI-Generated Summary