CORE-Bench:透過計算再現性代理基準來促進發表研究的可信度
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
September 17, 2024
作者: Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan
cs.AI
摘要
AI代理有潛力協助使用者進行各種重要任務,包括進行科學研究。為了推動有用代理的發展,我們需要具有挑戰性且與現實世界任務直接對應的基準。本文介紹了這樣一個基準,旨在衡量AI代理應對科學研究中一個至關重要且令人驚訝地具有挑戰性的方面的準確性:計算再現性。這個任務對科學過程至關重要,涉及使用提供的程式碼和數據重現研究結果。我們介紹了CORE-Bench(計算再現性代理基準),這是一個基準,包含了來自三個學科(計算機科學、社會科學和醫學)的90篇科學論文的270個任務。CORE-Bench的任務包括三個難度級別,包括僅語言和視覺-語言任務。我們提供了一個評估系統,以快速且可並行化的方式衡量代理的準確性,相較於順序實施,每次運行節省數天的評估時間。我們評估了兩個基準代理:通用的AutoGPT和一個特定任務的代理CORE-Agent。我們使用了兩個基礎語言模型進行測試:GPT-4o和GPT-4o-mini。最佳代理在最難的任務上實現了21%的準確性,顯示在自動化例行科學任務方面有巨大的改進空間。擁有能夠重現現有工作的代理是邁向構建能夠進行新研究並驗證和改進其他研究代理性能的必要步驟。我們希望CORE-Bench能提高再現性水平,推動未來研究代理的發展。
English
AI agents have the potential to aid users on a variety of consequential
tasks, including conducting scientific research. To spur the development of
useful agents, we need benchmarks that are challenging, but more crucially,
directly correspond to real-world tasks of interest. This paper introduces such
a benchmark, designed to measure the accuracy of AI agents in tackling a
crucial yet surprisingly challenging aspect of scientific research:
computational reproducibility. This task, fundamental to the scientific
process, involves reproducing the results of a study using the provided code
and data. We introduce CORE-Bench (Computational Reproducibility Agent
Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers
across three disciplines (computer science, social science, and medicine).
Tasks in CORE-Bench consist of three difficulty levels and include both
language-only and vision-language tasks. We provide an evaluation system to
measure the accuracy of agents in a fast and parallelizable way, saving days of
evaluation time for each run compared to a sequential implementation. We
evaluated two baseline agents: the general-purpose AutoGPT and a task-specific
agent called CORE-Agent. We tested both variants using two underlying language
models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on
the hardest task, showing the vast scope for improvement in automating routine
scientific tasks. Having agents that can reproduce existing work is a necessary
step towards building agents that can conduct novel research and could verify
and improve the performance of other research agents. We hope that CORE-Bench
can improve the state of reproducibility and spur the development of future
research agents.Summary
AI-Generated Summary