ChatPaper.aiChatPaper

AgentRewardBench:評估網路代理軌跡的自動評量系統

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

April 11, 2025
作者: Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J. Pal, Siva Reddy
cs.AI

摘要

網路代理程式讓使用者能夠透過自然語言互動在網頁瀏覽器上執行任務。評估網路代理程式的執行軌跡是一個重要課題,因為這能幫助我們判斷代理程式是否成功完成了任務。基於規則的方法被廣泛用於此目的,但它們難以擴展到新任務,且未必總能識別出成功的軌跡。雖然透過人工評估可能獲得更高的準確性,但這一過程會顯著更慢且成本更高。利用大型語言模型(LLM)進行自動評估,則能避免設計新規則和手動標註軌跡的挑戰,實現更快且更具成本效益的評估。然而,LLM在評估網路代理程式方面的有效性尚不明確。為此,我們提出了AgentRewardBench,這是首個用於評估LLM作為網路代理程式評判者有效性的基準測試。AgentRewardBench涵蓋了5個基準測試和4個LLM的1302條軌跡。每條軌跡均由專家審查,專家會回答有關代理程式成功與否、副作用及重複性等問題。利用我們的基準,我們評估了12個LLM評判者,發現沒有一個LLM能在所有基準測試中表現出色。我們還發現,常用基準測試採用的基於規則的評估往往低估了網路代理程式的成功率,這凸顯了基於規則評估的一個關鍵弱點,以及開發更靈活的自動評估方法的必要性。我們已將此基準測試發布於:https://agent-reward-bench.github.io
English
Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

Summary

AI-Generated Summary

PDF211April 15, 2025