AceMath: ポストトレーニングと報酬モデリングによるフロンティア数学推論の進化

要旨

本論文では、複雑な数学問題を解決するのに優れたフロンティア数学モデルのスイートであるAceMathを紹介します。また、生成された解を評価し正しい解を確実に特定する高効率な報酬モデルも紹介します。指示に調整された数学モデルを開発するために、まず競争力のある性能を一般領域全体で達成する監督されたファインチューニング（SFT）プロセスを提案し、その後、厳選されたプロンプトと合成的に生成された応答のセットを使用して数学領域のためにターゲットとなるファインチューニングを行います。その結果、AceMath-72B-Instructモデルは、Qwen2.5-Math-72B-Instruct、GPT-4o、およびClaude-3.5 Sonnetを大幅に上回ります。数学に特化した報酬モデルを開発するために、まず、数学報酬モデルを評価する包括的かつ堅牢なベンチマークであるAceMath-RewardBenchを構築します。その後、数学報酬モデルを構築するための体系的なアプローチを提示します。その結果、AceMath-72B-RMモデルは、常に最先端の報酬モデルを上回ります。さらに、AceMath-72B-InstructをAceMath-72B-RMと組み合わせると、数学推論のベンチマーク全体で最高の平均rm@8スコアを達成します。当社は、モデルの重み、トレーニングデータ、および評価ベンチマークを以下のURLで公開します：https://research.nvidia.com/labs/adlr/acemath

English

In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath

AceMath: ポストトレーニングと報酬モデリングによるフロンティア数学推論の進化

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

要旨

Summary

Support

Support