TheAgentCompany:在具有重要现实世界意义的任务上对LLM代理进行基准测试
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
December 18, 2024
作者: Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig
cs.AI
摘要
我们每天都与计算机进行互动,无论是在日常生活还是工作中,许多工作都可以完全借助计算机和互联网完成。与此同时,由于大型语言模型(LLMs)的改进,人工智能代理也在与周围环境互动并产生影响方面迅速发展。但是,人工智能代理在帮助加速或甚至自主执行与工作相关的任务方面表现如何?这个问题的答案对于希望将人工智能引入工作流程的行业以及经济政策理解人工智能可能对劳动力市场产生的影响具有重要意义。为了衡量这些LLM代理在执行真实世界专业任务方面的表现进展,本文引入了TheAgentCompany,这是一个用于评估与数字工作者类似方式与世界互动的人工智能代理的可扩展基准。这些代理通过浏览网络、编写代码、运行程序和与其他同事交流来模拟数字工作者的方式。我们构建了一个自包含环境,其中包含模拟小型软件公司环境的内部网站和数据,并创建了一系列可能由这样一家公司的工作人员执行的任务。我们测试了由基于封闭API和开放权重语言模型(LMs)驱动的基准代理,发现在最具竞争力的代理中,有24%的任务可以自主完成。这描绘了关于LM代理任务自动化的细致图景——在模拟真实工作场所的情境中,相当一部分简单任务可以自主解决,但更困难的长期任务仍然超出了当前系统的能力范围。
English
We interact with computers on an everyday basis, be it in everyday life or
work, and many aspects of work can be done entirely with access to a computer
and the Internet. At the same time, thanks to improvements in large language
models (LLMs), there has also been a rapid development in AI agents that
interact with and affect change in their surrounding environments. But how
performant are AI agents at helping to accelerate or even autonomously perform
work-related tasks? The answer to this question has important implications for
both industry looking to adopt AI into their workflows, and for economic policy
to understand the effects that adoption of AI may have on the labor market. To
measure the progress of these LLM agents' performance on performing real-world
professional tasks, in this paper, we introduce TheAgentCompany, an extensible
benchmark for evaluating AI agents that interact with the world in similar ways
to those of a digital worker: by browsing the Web, writing code, running
programs, and communicating with other coworkers. We build a self-contained
environment with internal web sites and data that mimics a small software
company environment, and create a variety of tasks that may be performed by
workers in such a company. We test baseline agents powered by both closed
API-based and open-weights language models (LMs), and find that with the most
competitive agent, 24% of the tasks can be completed autonomously. This paints
a nuanced picture on task automation with LM agents -- in a setting simulating
a real workplace, a good portion of simpler tasks could be solved autonomously,
but more difficult long-horizon tasks are still beyond the reach of current
systems.Summary
AI-Generated Summary