PC-Agent: PC에서의 복잡한 작업 자동화를 위한 계층적 다중 에이전트 협업 프레임워크

초록

MLLM 기반 GUI 에이전트 분야에서, 스마트폰과 비교했을 때 PC 시나리오는 더 복잡한 상호작용 환경을 특징으로 할 뿐만 아니라, 더 정교한 앱 내부 및 앱 간 워크플로우를 포함합니다. 이러한 문제를 해결하기 위해, 우리는 PC-Agent라는 계층적 에이전트 프레임워크를 제안합니다. 구체적으로, 인식 측면에서 현재 MLLM의 스크린샷 내용 인식 능력 부족을 극복하기 위해 Active Perception Module(APM)을 설계했습니다. 의사결정 측면에서는 복잡한 사용자 지시와 상호 의존적인 하위 작업을 더 효과적으로 처리하기 위해, 의사결정 프로세스를 Instruction-Subtask-Action 수준으로 분해하는 계층적 다중 에이전트 협업 아키텍처를 제안합니다. 이 아키텍처 내에서, 지시 분해, 진행 상황 추적 및 단계별 의사결정을 각각 담당하는 세 가지 에이전트(Manager, Progress, Decision)가 설정됩니다. 또한, Reflection 에이전트를 도입하여 시기적절한 하향식 오류 피드백과 조정이 가능하도록 했습니다. 우리는 또한 25개의 실제 복잡한 지시를 포함한 새로운 벤치마크 PC-Eval을 소개합니다. PC-Eval에 대한 실험 결과, 우리의 PC-Agent는 기존 최첨단 방법 대비 작업 성공률에서 32%의 절대적 개선을 달성했습니다. 코드는 공개될 예정입니다.

English

In the field of MLLM-based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra- and inter-app workflows. To address these issues, we propose a hierarchical agent framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. We also introduce a new benchmark PC-Eval with 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task success rate over previous state-of-the-art methods. The code will be publicly available.

PC-Agent: PC에서의 복잡한 작업 자동화를 위한 계층적 다중 에이전트 협업 프레임워크

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

초록

Support