MLGym:推动AI研究智能体发展的新框架与基准平台
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
February 20, 2025
作者: Deepak Nathani, Lovish Madaan, Nicholas Roberts, Nikolay Bashlykov, Ajay Menon, Vincent Moens, Amar Budhiraja, Despoina Magka, Vladislav Vorotilov, Gaurav Chaurasia, Dieuwke Hupkes, Ricardo Silveira Cabral, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach, William Yang Wang, Roberta Raileanu
cs.AI
摘要
我们推出了Meta MLGym与MLGym-Bench,这是一套全新的框架与基准测试,旨在评估和开发面向AI研究任务的大语言模型(LLM)智能体。这是首个专为机器学习(ML)任务设计的Gym环境,为研究训练此类智能体的强化学习(RL)算法提供了平台。MLGym-Bench包含了13项来自不同领域的多样化且开放式的AI研究任务,涵盖计算机视觉、自然语言处理、强化学习及博弈论等。解决这些任务需要具备现实世界中的AI研究技能,如生成新想法与假设、创建与处理数据、实施ML方法、训练模型、运行实验、分析结果,并通过这一过程迭代以提升特定任务的表现。我们在基准测试中评估了多款前沿大语言模型,包括Claude-3.5-Sonnet、Llama-3.1 405B、GPT-4o、o1-preview及Gemini-1.5 Pro。MLGym框架简化了添加新任务、集成与评估模型或智能体、大规模生成合成数据以及开发新学习算法以训练AI研究任务智能体的过程。我们发现,当前的前沿模型能够通过寻找更优的超参数来改进给定基线,但尚未能生成新颖的假设、算法、架构或实现显著提升。我们开源了此框架与基准测试,以促进未来在提升LLM智能体AI研究能力方面的探索。
English
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for
evaluating and developing LLM agents on AI research tasks. This is the first
Gym environment for machine learning (ML) tasks, enabling research on
reinforcement learning (RL) algorithms for training such agents. MLGym-bench
consists of 13 diverse and open-ended AI research tasks from diverse domains
such as computer vision, natural language processing, reinforcement learning,
and game theory. Solving these tasks requires real-world AI research skills
such as generating new ideas and hypotheses, creating and processing data,
implementing ML methods, training models, running experiments, analyzing the
results, and iterating through this process to improve on a given task. We
evaluate a number of frontier large language models (LLMs) on our benchmarks
such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5
Pro. Our MLGym framework makes it easy to add new tasks, integrate and evaluate
models or agents, generate synthetic data at scale, as well as develop new
learning algorithms for training agents on AI research tasks. We find that
current frontier models can improve on the given baselines, usually by finding
better hyperparameters, but do not generate novel hypotheses, algorithms,
architectures, or substantial improvements. We open-source our framework and
benchmark to facilitate future research in advancing the AI research
capabilities of LLM agents.Summary
AI-Generated Summary