ChatPaper.aiChatPaper

AgentRewardBench:网络智能体轨迹自动评估的评测基准

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

April 11, 2025
作者: Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Alejandra Zambrano, Karolina Stańczak, Peter Shaw, Christopher J. Pal, Siva Reddy
cs.AI

摘要

网络代理使用户能够通过自然语言交互在网页浏览器上执行任务。评估网络代理的执行轨迹是一个重要课题,因为它帮助我们判断代理是否成功完成了任务。基于规则的方法被广泛用于此目的,但它们难以扩展到新任务,且可能无法始终识别成功的轨迹。虽然通过人工评估可能获得更高的准确性,但这一过程会显著更慢且成本更高。利用大语言模型(LLMs)进行自动评估,可以避免设计新规则和手动标注轨迹的挑战,实现更快且成本效益更高的评估。然而,它们在评估网络代理方面的有效性尚不明确。为此,我们提出了AgentRewardBench,这是首个用于评估LLM作为网络代理评估者有效性的基准。AgentRewardBench包含来自5个基准测试和4个LLM的1302条轨迹。每条轨迹均由专家审核,专家回答关于代理成功与否、副作用及重复性的问题。利用我们的基准,我们评估了12个LLM评估者,发现没有单一LLM在所有基准测试中表现优异。我们还发现,常用基准测试采用的基于规则的评估往往低估了网络代理的成功率,这揭示了基于规则评估的一个关键弱点,以及开发更灵活自动评估方法的必要性。我们已将该基准发布于:https://agent-reward-bench.github.io。
English
Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

Summary

AI-Generated Summary

PDF272April 15, 2025