윈도우 에이전트 아레나: 규모에 맞게 다중 모달 OS 에이전트 평가

초록

대형 언어 모델(LLMs)은 컴퓨터 에이전트로 작용하여 인간의 생산성을 향상시키고 계획 및 추론이 필요한 다중 모달 작업에서 소프트웨어 접근성을 향상시키는 놀라운 잠재력을 보여줍니다. 그러나 현실적인 환경에서 에이전트 성능을 측정하는 것은 여전히 어려운 과제입니다. 왜냐하면: (i) 대부분의 벤치마크가 특정 모달리티나 도메인(예: 텍스트 전용, 웹 탐색, Q&A, 코딩)으로 제한되어 있고 (ii) 다단계 순차적인 작업의 특성으로 인해 전체 벤치마크 평가가 느립니다(일 수준의 순서). 이러한 도전에 대응하기 위해 Windows 에이전트 아레나를 소개합니다: Windows 운영 체제(OS)에만 초점을 맞춘 재현 가능한 일반 환경으로, 에이전트들이 작업을 해결할 때 인간 사용자가 사용하는 것과 동일한 다양한 응용 프로그램, 도구 및 웹 브라우저를 자유롭게 사용할 수 있는 환경입니다. 우리는 OSWorld 프레임워크(Xie et al., 2024)를 적용하여 계획, 화면 이해 및 도구 사용 능력이 필요한 대표적인 도메인에서 150개 이상의 다양한 Windows 작업을 생성했습니다. 우리의 벤치마크는 확장 가능하며 Azure에서 완전한 벤치마크 평가를 20분 만에 원활하게 병렬화할 수 있습니다. Windows 에이전트 아레나의 능력을 시연하기 위해 새로운 다중 모달 에이전트 Navi를 소개합니다. 우리의 에이전트는 Windows 도메인에서 19.5%의 성공률을 달성하며, 비지원 인간의 74.5% 성능과 비교됩니다. Navi는 또한 다른 인기 있는 웹 기반 벤치마크 Mind2Web에서 강력한 성능을 보여줍니다. Navi의 성능에 대한 포괄적인 양적 및 질적 분석을 제공하고, Windows 에이전트 아레나를 활용한 에이전트 개발 및 데이터 생성에 대한 미래 연구 기회에 대한 통찰을 제공합니다. 웹페이지: https://microsoft.github.io/WindowsAgentArena 코드: https://github.com/microsoft/WindowsAgentArena

English

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

윈도우 에이전트 아레나: 규모에 맞게 다중 모달 OS 에이전트 평가

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

초록

Summary

Support

Support