ChatPaper.aiChatPaper

MLRC-Bench:語言代理能否解決機器學習研究挑戰?

MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?

April 13, 2025
作者: Yunxiang Zhang, Muhammad Khalifa, Shitanshu Bhushan, Grant D Murphy, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, Lu Wang
cs.AI

摘要

現有針對大型語言模型(LLM)代理在科學發現領域的評估,缺乏客觀的基準和指標來評估其提出方法的可行性。為解決這一問題,我們引入了MLRC-Bench,這是一個旨在量化語言代理如何有效應對具挑戰性的機器學習(ML)研究競賽的基準。我們的基準強調了需要新方法論的開放研究問題,與近期如OpenAI的MLE-Bench(Chan等,2024)和METR的RE-Bench(Wijk等,2024)等聚焦於已確立且主要通過足夠工程努力即可解決的研究任務的基準形成對比。與先前工作(例如,AI Scientist(Lu等,2024b))不同,後者通過使用LLM作為評判者來評估端到端的代理流程,MLRC-Bench則衡量提出和實施新研究方法的關鍵步驟,並以新提出的嚴格協議和客觀指標進行評估。我們精心挑選的7項競賽任務揭示了LLM代理面臨的重大挑戰。即使表現最佳的測試代理(在MLAB(Huang等,2024a)下的gemini-exp-1206)也僅縮小了基線與頂尖人類參與者分數之間差距的9.3%。此外,我們的分析揭示了LLM評判的創新性與其在尖端ML研究問題上的實際表現之間存在偏差。MLRC-Bench是一個動態基準,旨在隨著新ML競賽的加入而不斷發展,以鼓勵對AI研究能力進行嚴格和客觀的評估。
English
Existing evaluation of large language model (LLM) agents on scientific discovery lacks objective baselines and metrics to assess the viability of their proposed methods. To address this issue, we introduce MLRC-Bench, a benchmark designed to quantify how effectively language agents can tackle challenging Machine Learning (ML) Research Competitions. Our benchmark highlights open research problems that demand novel methodologies, in contrast to recent benchmarks such as OpenAI's MLE-Bench (Chan et al., 2024) and METR's RE-Bench (Wijk et al., 2024), which focus on well-established research tasks that are largely solvable through sufficient engineering effort. Unlike prior work, e.g., AI Scientist (Lu et al., 2024b), which evaluates the end-to-end agentic pipeline by using LLM-as-a-judge, MLRC-Bench measures the key steps of proposing and implementing novel research methods and evaluates them with newly proposed rigorous protocol and objective metrics. Our curated suite of 7 competition tasks reveals significant challenges for LLM agents. Even the best-performing tested agent (gemini-exp-1206 under MLAB (Huang et al., 2024a)) closes only 9.3% of the gap between baseline and top human participant scores. Furthermore, our analysis reveals a misalignment between the LLM-judged innovation and their actual performance on cutting-edge ML research problems. MLRC-Bench is a dynamic benchmark, which is designed to continually grow with new ML competitions to encourage rigorous and objective evaluations of AI's research capabilities.

Summary

AI-Generated Summary

PDF132April 17, 2025