ScienceAgentBench：朝向對於以數據驅動科學發現的語言代理進行嚴格評估

摘要

語言模型（LLM）的進步引起了對開發基於LLM的語言代理以實現科學發現端對端自動化的日益興趣，這引發了人們對此類代理真正能力的興奮和懷疑。在這項工作中，我們認為要使代理完全自動化科學發現，它必須能夠完成工作流程中的所有基本任務。因此，我們呼籲在對端對端自動化提出大膽主張之前，應對代理在科學工作流程中的各個任務進行嚴格評估。為此，我們提出了ScienceAgentBench，這是一個用於評估基於語言的代理進行數據驅動科學發現的新基準。為確保我們基準的科學真實性和現實關聯性，我們從四個學科的44篇同行評審出版物中提取了102個任務，並邀請了九位學科專家對其進行驗證。我們將每個任務的目標輸出統一為一個獨立的Python程序文件，並使用一系列評估指標來檢查生成的程序、執行結果和成本。每個任務都經過多輪由標註者和學科專家手動驗證，以確保其標註質量和科學合理性。我們還提出了兩種有效策略來減輕數據污染的擔憂。使用我們的基準，我們評估了五種開放權重和專有LLM，每種LLM都有三種框架：直接提示、OpenHands和自我調試。對於每個任務有三次嘗試，表現最佳的代理僅能獨立解決32.4%的任務，並在專家提供知識的情況下解決34.3%。這些結果突顯了當前語言代理在生成用於數據驅動發現的代碼方面的有限能力，更不用說端對端自動化科學研究了。

English

The advancements of language language models (LLMs) have piqued growing interest in developing LLM-based language agents to automate scientific discovery end-to-end, which has sparked both excitement and skepticism about the true capabilities of such agents. In this work, we argue that for an agent to fully automate scientific discovery, it must be able to complete all essential tasks in the workflow. Thus, we call for rigorous assessment of agents on individual tasks in a scientific workflow before making bold claims on end-to-end automation. To this end, we present ScienceAgentBench, a new benchmark for evaluating language agents for data-driven scientific discovery. To ensure the scientific authenticity and real-world relevance of our benchmark, we extract 102 tasks from 44 peer-reviewed publications in four disciplines and engage nine subject matter experts to validate them. We unify the target output for every task to a self-contained Python program file and employ an array of evaluation metrics to examine the generated programs, execution results, and costs. Each task goes through multiple rounds of manual validation by annotators and subject matter experts to ensure its annotation quality and scientific plausibility. We also propose two effective strategies to mitigate data contamination concerns. Using our benchmark, we evaluate five open-weight and proprietary LLMs, each with three frameworks: direct prompting, OpenHands, and self-debug. Given three attempts for each task, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. These results underscore the limited capacities of current language agents in generating code for data-driven discovery, let alone end-to-end automation for scientific research.

ScienceAgentBench：朝向對於以數據驅動科學發現的語言代理進行嚴格評估

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

摘要

Summary

Support

Support