ChatPaper.aiChatPaper

PaperBench:评估AI复制AI研究的能力

PaperBench: Evaluating AI's Ability to Replicate AI Research

April 2, 2025
作者: Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan
cs.AI

摘要

我们推出PaperBench,这是一个评估AI代理复制最前沿AI研究能力的基准测试。代理需从零开始复制20篇ICML 2024 Spotlight和Oral论文,包括理解论文贡献、开发代码库以及成功执行实验。为了客观评估,我们制定了评分标准,将每项复制任务层次化分解为具有明确评分细则的子任务。PaperBench总计包含8,316个可独立评分的任务。这些评分标准与每篇ICML论文的作者共同开发,以确保准确性和真实性。为实现可扩展的评估,我们还开发了一个基于LLM的评判器,用于自动根据评分标准对复制尝试进行评分,并通过创建一个独立的评判基准来评估该评判器的性能。我们在PaperBench上测试了多个前沿模型,发现表现最佳的测试代理——配备开源框架的Claude 3.5 Sonnet(新版)——平均复制得分为21.0%。最后,我们邀请顶尖机器学习博士生尝试PaperBench的一部分任务,发现模型尚未超越人类基准。我们已将代码开源至https://github.com/openai/preparedness,以促进未来在理解AI代理工程能力方面的研究。
English
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0\%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We https://github.com/openai/preparedness{open-source our code} to facilitate future research in understanding the AI engineering capabilities of AI agents.

Summary

AI-Generated Summary

PDF362April 3, 2025