ChatPaper.aiChatPaper

MLRC-Bench:语言智能体能否攻克机器学习研究难题?

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

April 13, 2025
作者: Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang
cs.AI

摘要

现有对大型语言模型(LLM)代理在科学发现领域的评估,缺乏客观的基准和指标来衡量其提出方法的可行性。为解决这一问题,我们引入了MLRC-Bench,这是一个旨在量化语言代理应对具有挑战性的机器学习(ML)研究竞赛能力的基准。我们的基准突出了需要新方法解决的开放研究问题,与近期如OpenAI的MLE-Bench(Chan等,2024)和METR的RE-Bench(Wijk等,2024)等专注于可通过充分工程努力解决的老牌研究任务的基准形成对比。不同于以往工作,例如AI Scientist(Lu等,2024b)通过使用LLM作为评判者来评估端到端代理流程,MLRC-Bench则着重于提出与实施新颖研究方法的关键步骤,并采用新提出的严格协议和客观指标进行评估。我们精选的7项竞赛任务揭示了LLM代理面临的重大挑战。即便表现最佳的测试代理(在MLAB框架下的gemini-exp-1206,Huang等,2024a)也仅缩小了基线得分与顶尖人类参与者得分之间差距的9.3%。此外,我们的分析揭示了LLM评判的创新性与其在尖端ML研究问题上的实际表现之间的不一致。MLRC-Bench是一个动态基准,旨在随着新ML竞赛的加入不断扩展,以促进对AI研究能力进行严格且客观的评估。
English
Existing evaluation of large language model (LLM) agents on scientific discovery lacks objective baselines and metrics to assess the viability of their proposed methods. To address this issue, we introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions. Our benchmark highlights open research problems that demand novel methodologies, in contrast to recent benchmarks such as OpenAI's MLE-Bench (Chan et al., 2024) and METR's RE-Bench (Wijk et al., 2024), which focus on well-established research tasks that are largely solvable through sufficient engineering effort. Unlike prior work, e.g., AI Scientist (Lu et al., 2024b), which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with newly proposed rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB (Huang et al., 2024a)) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and their actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI's research capabilities.

Summary

AI-Generated Summary

PDF172April 17, 2025