AgentRewardBench:評估網路代理軌跡的自動評量系統
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
April 11, 2025
作者: Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J. Pal, Siva Reddy
cs.AI
摘要
網路代理程式讓使用者能夠透過自然語言互動在網頁瀏覽器上執行任務。評估網路代理程式的執行軌跡是一個重要課題,因為這能幫助我們判斷代理程式是否成功完成了任務。基於規則的方法被廣泛用於此目的,但它們難以擴展到新任務,且未必總能識別出成功的軌跡。雖然透過人工評估可能獲得更高的準確性,但這一過程會顯著更慢且成本更高。利用大型語言模型(LLM)進行自動評估,則能避免設計新規則和手動標註軌跡的挑戰,實現更快且更具成本效益的評估。然而,LLM在評估網路代理程式方面的有效性尚不明確。為此,我們提出了AgentRewardBench,這是首個用於評估LLM作為網路代理程式評判者有效性的基準測試。AgentRewardBench涵蓋了5個基準測試和4個LLM的1302條軌跡。每條軌跡均由專家審查,專家會回答有關代理程式成功與否、副作用及重複性等問題。利用我們的基準,我們評估了12個LLM評判者,發現沒有一個LLM能在所有基準測試中表現出色。我們還發現,常用基準測試採用的基於規則的評估往往低估了網路代理程式的成功率,這凸顯了基於規則評估的一個關鍵弱點,以及開發更靈活的自動評估方法的必要性。我們已將此基準測試發布於:https://agent-reward-bench.github.io
English
Web agents enable users to perform tasks on web browsers through natural
language interaction. Evaluating web agents trajectories is an important
problem, since it helps us determine whether the agent successfully completed
the tasks. Rule-based methods are widely used for this purpose, but they are
challenging to extend to new tasks and may not always recognize successful
trajectories. We may achieve higher accuracy through human evaluation, but the
process would be substantially slower and more expensive. Automatic evaluations
with LLMs may avoid the challenges of designing new rules and manually
annotating trajectories, enabling faster and cost-effective evaluation.
However, it is unclear how effective they are at evaluating web agents. To this
end, we propose AgentRewardBench, the first benchmark to assess the
effectiveness of LLM judges for evaluating web agents. AgentRewardBench
contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in
AgentRewardBench is reviewed by an expert, who answers questions pertaining to
the success, side effects, and repetitiveness of the agent. Using our
benchmark, we evaluate 12 LLM judges and find that no single LLM excels across
all benchmarks. We also find that the rule-based evaluation used by common
benchmarks tends to underreport the success rate of web agents, highlighting a
key weakness of rule-based evaluation and the need to develop more flexible
automatic evaluations. We release the benchmark at:
https://agent-reward-bench.github.ioSummary
AI-Generated Summary