TheAgentCompany: 중대한 현실 세계 작업에 대한 LLM 에이전트의 벤치마킹

초록

우리는 매일 컴퓨터와 상호 작용합니다. 일상 생활이나 업무에서 컴퓨터와 인터넷에 접속하여 많은 작업을 완료할 수 있습니다. 동시에 대형 언어 모델(LLMs)의 개선으로, 주변 환경과 상호 작용하며 변화를 일으키는 AI 에이전트들의 급속한 발전도 있었습니다. 그렇다면 AI 에이전트들은 작업 관련 작업을 가속화하거나 심지어 자율적으로 수행하는 데 얼마나 성능이 좋을까요? 이 질문에 대한 답은 AI를 워크플로에 도입하려는 산업과 AI 도입이 노동 시장에 미칠 영향을 이해하려는 경제 정책에 중요한 영향을 미칩니다. 이 논문에서는 이러한 LLM 에이전트들의 실제 전문적 작업 수행 능력을 측정하기 위해 TheAgentCompany를 소개합니다. 이는 디지털 워커와 유사한 방식으로 세계와 상호 작용하는 AI 에이전트를 평가하는 확장 가능한 벤치마크입니다. 웹 브라우징, 코드 작성, 프로그램 실행, 동료와의 소통을 통해 작업을 수행합니다. 우리는 소프트웨어 회사 환경을 모방한 내부 웹 사이트와 데이터로 구성된 독립적인 환경을 구축하고, 이와 같은 회사에서 수행될 수 있는 다양한 작업을 만듭니다. 우리는 닫힌 API 기반 및 오픈 가중치 언어 모델(LMs)을 활용한 기준선 에이전트를 테스트하고, 가장 경쟁력 있는 에이전트로 24%의 작업을 자율적으로 완료할 수 있음을 발견했습니다. 이는 LM 에이전트에 의한 작업 자동화에 대해 세밀한 그림을 그립니다. 실제 직장을 시뮬레이션하는 환경에서, 더 간단한 작업의 상당 부분이 자율적으로 해결될 수 있지만, 더 어려운 장기적인 작업은 현재 시스템의 능력을 벗어납니다.

English

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.

TheAgentCompany: 중대한 현실 세계 작업에 대한 LLM 에이전트의 벤치마킹

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

초록

Summary

Support

Support