AceMath:通过后训练和奖励建模推进前沿数学推理
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
December 19, 2024
作者: Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
cs.AI
摘要
本文介绍了AceMath,这是一套在解决复杂数学问题方面表现出色的前沿数学模型,配备高效的奖励模型,能够评估生成的解决方案并可靠地识别正确答案。为了开发针对指导的数学模型,我们提出了一个监督微调(SFT)过程,首先在一般领域取得竞争性表现,然后针对数学领域使用精心策划的提示集和合成生成的响应进行有针对性的微调。最终产生的模型AceMath-72B-Instruct在性能上远远超过了Qwen2.5-Math-72B-Instruct、GPT-4o和Claude-3.5 Sonnet。为了开发专门针对数学的奖励模型,我们首先构建了AceMath-RewardBench,这是一个全面而强大的基准,用于评估不同问题和难度级别下的数学奖励模型。之后,我们提出了一个系统方法来构建我们的数学奖励模型。最终产生的模型AceMath-72B-RM在性能上始终优于最先进的奖励模型。此外,当将AceMath-72B-Instruct与AceMath-72B-RM相结合时,我们在数学推理基准测试中实现了最高的平均rm@8分数。我们将在以下网址发布模型权重、训练数据和评估基准:https://research.nvidia.com/labs/adlr/acemath
English
In this paper, we introduce AceMath, a suite of frontier math models that
excel in solving complex math problems, along with highly effective reward
models capable of evaluating generated solutions and reliably identifying the
correct ones. To develop the instruction-tuned math models, we propose a
supervised fine-tuning (SFT) process that first achieves competitive
performance across general domains, followed by targeted fine-tuning for the
math domain using a carefully curated set of prompts and synthetically
generated responses. The resulting model, AceMath-72B-Instruct greatly
outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop
math-specialized reward model, we first construct AceMath-RewardBench, a
comprehensive and robust benchmark for evaluating math reward models across
diverse problems and difficulty levels. After that, we present a systematic
approach to build our math reward models. The resulting model, AceMath-72B-RM,
consistently outperforms state-of-the-art reward models. Furthermore, when
combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest
average rm@8 score across the math reasoning benchmarks. We will release model
weights, training data, and evaluation benchmarks at:
https://research.nvidia.com/labs/adlr/acemathSummary
AI-Generated Summary