GUI 에이전트: 조사

초록

대규모 기반 모델을 활용한 그래픽 사용자 인터페이스(GUI) 에이전트는 인간-컴퓨터 상호작용을 자동화하는 혁신적인 접근 방식으로 등장했습니다. 이러한 에이전트들은 GUI를 통해 디지털 시스템이나 소프트웨어 애플리케이션과 자율적으로 상호작용하며, 다양한 플랫폼에서 클릭, 타이핑, 시각적 요소 탐색 등과 같은 인간의 행동을 모방합니다. GUI 에이전트에 대한 점점 더 높아지는 관심과 기본적인 중요성을 고려하여, 우리는 그들의 벤치마크, 평가 메트릭, 아키텍처, 그리고 훈련 방법을 분류하는 포괄적인 조사를 제공합니다. 우리는 그들의 지각, 추론, 계획, 그리고 행동 능력을 구분하는 통합된 프레임워크를 제안합니다. 더불어, 중요한 오픈 챌린지를 식별하고 주요 미래 방향을 논의합니다. 마지막으로, 이 연구는 현재 진행 상황, 기술, 벤치마크, 그리고 아직 해결되지 않은 중요한 문제에 대한 직관적인 이해를 얻기 위한 실무자와 연구자들의 기초 자료로 기여합니다.

English

Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems or software applications via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.

GUI 에이전트: 조사

GUI Agents: A Survey

초록

Support