AceMath：通过后训练和奖励建模推进前沿数学推理

摘要

本文介绍了AceMath，这是一套在解决复杂数学问题方面表现出色的前沿数学模型，配备高效的奖励模型，能够评估生成的解决方案并可靠地识别正确答案。为了开发针对指导的数学模型，我们提出了一个监督微调（SFT）过程，首先在一般领域取得竞争性表现，然后针对数学领域使用精心策划的提示集和合成生成的响应进行有针对性的微调。最终产生的模型AceMath-72B-Instruct在性能上远远超过了Qwen2.5-Math-72B-Instruct、GPT-4o和Claude-3.5 Sonnet。为了开发专门针对数学的奖励模型，我们首先构建了AceMath-RewardBench，这是一个全面而强大的基准，用于评估不同问题和难度级别下的数学奖励模型。之后，我们提出了一个系统方法来构建我们的数学奖励模型。最终产生的模型AceMath-72B-RM在性能上始终优于最先进的奖励模型。此外，当将AceMath-72B-Instruct与AceMath-72B-RM相结合时，我们在数学推理基准测试中实现了最高的平均rm@8分数。我们将在以下网址发布模型权重、训练数据和评估基准：https://research.nvidia.com/labs/adlr/acemath

English

In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath

AceMath：通过后训练和奖励建模推进前沿数学推理

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

摘要

Support