LLM驅動的GUI代理在手機自動化中的應用：進展與前景綜述

摘要

隨著大型語言模型（LLMs）的迅速崛起，手機自動化技術經歷了革命性的變革。本文系統性地回顧了基於LLM驅動的手機圖形用戶界面（GUI）代理，重點探討了其從基於腳本的自動化向智能、適應性系統的演進過程。我們首先闡述了關鍵挑戰：（一）通用性有限，（二）維護成本高，（三）意圖理解能力弱，並展示了LLM如何通過高級語言理解、多模態感知及穩健的決策能力來解決這些問題。接著，我們提出了一個分類體系，涵蓋了基礎代理框架（單代理、多代理、先計劃後行動）、建模方法（提示工程、基於訓練的）以及核心數據集與基準測試。此外，我們詳細介紹了任務特定的架構、監督微調以及強化學習策略，這些策略在用戶意圖與GUI操作之間架起了橋樑。最後，我們探討了開放性挑戰，如數據集多樣性、設備端部署效率、以用戶為中心的適應性及安全問題，為這一快速發展的領域提供了前瞻性見解。通過提供結構化的概述並指出亟待解決的研究空白，本文旨在為研究者和實踐者設計可擴展、用戶友好的手機GUI代理提供權威參考，助力他們充分利用LLM的潛力。

English

With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents.

LLM驅動的GUI代理在手機自動化中的應用：進展與前景綜述

LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

摘要

Summary

Support

Support