TheAgentCompany:在具有重大現實世界影響的任務上對LLM代理進行基準測試

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

December 18, 2024
作者: Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig
cs.AI

摘要

我們每天都在與電腦互動,不論是在日常生活還是工作中,許多工作都可以完全依靠電腦和互聯網完成。與此同時,由於大型語言模型(LLMs)的改進,人工智能代理也有了快速發展,這些代理與周圍環境互動並產生影響。但人工智能代理在幫助加速或甚至自主執行與工作相關的任務方面表現如何?這個問題的答案對於希望將人工智能納入工作流程的行業以及經濟政策都具有重要意義,以了解人工智能的應用對勞動力市場可能產生的影響。為了衡量這些LLM代理在執行真實世界專業任務方面的進展,本文引入了TheAgentCompany,這是一個可擴展的基準測試,用於評估與數字工作者類似方式與世界互動的人工智能代理:通過瀏覽網頁、編寫代碼、運行程序和與其他同事溝通。我們建立了一個自包含環境,其中包含內部網站和數據,模擬了一家小型軟件公司的環境,並創建了各種可能由這樣一家公司的工作人員執行的任務。我們測試了由基於封閉API和開放權重語言模型(LMs)驅動的基準代理,發現在最具競爭力的代理中,有24%的任務可以自主完成。這描繪了一幅關於LM代理任務自動化的微妙畫面——在模擬真實工作場所的情況下,許多較簡單的任務可以自主解決,但更困難的長期任務仍然超出了當前系統的能力範圍。
English
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.

Summary

AI-Generated Summary

PDF502December 19, 2024