GUI エージェント：調査

要旨

大規模な基盤モデルによって強化されたグラフィカルユーザーインターフェース（GUI）エージェントは、人間とコンピュータの相互作用を自動化する革新的なアプローチとして登場しています。これらのエージェントはGUIを介してデジタルシステムやソフトウェアアプリケーションと自律的にやり取りし、クリック、タイピング、さまざまなプラットフォーム上での視覚要素のナビゲーションなど、人間の行動をエミュレートします。GUIエージェントへの関心の高まりと基本的な重要性に触発され、私たちは、それらのベンチマーク、評価メトリクス、アーキテクチャ、およびトレーニング方法を分類する包括的な調査を提供します。私たちは、それらの知覚、推論、計画、および行動能力を明確に定義する統一されたフレームワークを提案します。さらに、重要な未解決の課題を特定し、主要な将来の方向性について議論します。最後に、この研究は、実務家や研究者が現在の進歩、技術、ベンチマーク、および解決すべき重要な未解決の問題に対する直感的な理解を得るための基盤となります。

English

Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.

GUI エージェント：調査

GUI Agents: A Survey

要旨

Summary

Support

Support