ChatPaper.aiChatPaper

審視語言模型推理進展:可重現性的挑戰與路徑

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

April 9, 2025
作者: Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, Matthias Bethge
cs.AI

摘要

推理已成为语言模型(LMs)的下一个重要前沿领域,学术界与工业界实验室均取得了快速进展。然而,这一进展往往超越了方法论的严谨性,许多评估依赖于缺乏透明度、鲁棒性或统计基础的基准测试实践。在本研究中,我们开展了一项全面的实证研究,发现当前的数学推理基准对微妙的实现选择高度敏感——包括解码参数、随机种子、提示格式,甚至硬件和软件框架配置。近期研究中报告的性能提升常常依赖于不明确的比较或未报告的变异来源。为解决这些问题,我们提出了一个标准化的评估框架,其中包含明确定义的最佳实践和报告标准。利用该框架,我们重新评估了近期的方法,发现强化学习(RL)方法仅带来有限的改进——远低于先前的声称——并且容易过拟合,尤其是在像AIME24这样的小规模基准上。相比之下,监督微调(SFT)方法展现出更为一致的泛化能力。为促进可复现性,我们公开了所有代码、提示和模型输出,为推理基准建立了更为严格的基础,为未来工作铺平道路。
English
Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on benchmarking practices that lack transparency, robustness, or statistical grounding. In this work, we conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices - including decoding parameters, random seeds, prompt formatting, and even hardware and software-framework configurations. Performance gains reported in recent studies frequently hinge on unclear comparisons or unreported sources of variance. To address these issues, we propose a standardized evaluation framework with clearly defined best practices and reporting standards. Using this framework, we reassess recent methods and find that reinforcement learning (RL) approaches yield only modest improvements - far below prior claims - and are prone to overfitting, especially on small-scale benchmarks like AIME24. In contrast, supervised finetuning (SFT) methods show consistently stronger generalization. To foster reproducibility, we release all code, prompts, and model outputs, for reasoning benchmarks, establishing more rigorous foundations for future work.

Summary

AI-Generated Summary

PDF183April 10, 2025