AceMath:透過後訓練和獎勵建模推進前沿數學推理

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

December 19, 2024
作者: Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
cs.AI

摘要

本文介紹了AceMath,一套優秀解決複雜數學問題的前沿數學模型,以及高效的獎勵模型,能夠評估生成的解決方案並可靠地識別正確解。為了開發針對指導的數學模型,我們提出了一個監督微調(SFT)過程,首先在一般領域取得競爭性表現,然後使用精心策劃的提示和合成生成的回應對數學領域進行有針對性的微調。最終產生的模型AceMath-72B-Instruct在性能上遠遠優於Qwen2.5-Math-72B-Instruct、GPT-4o和Claude-3.5 Sonnet。為了開發數學專用的獎勵模型,我們首先構建了AceMath-RewardBench,這是一個全面而堅固的基準,用於評估不同問題和難度水準下的數學獎勵模型。之後,我們提出了一種系統方法來構建我們的數學獎勵模型。最終產生的模型AceMath-72B-RM在性能上一直優於最先進的獎勵模型。此外,當將AceMath-72B-Instruct與AceMath-72B-RM結合時,我們在數學推理基準測試中實現了最高的平均rm@8分數。我們將在以下網址釋出模型權重、訓練數據和評估基準:https://research.nvidia.com/labs/adlr/acemath
English
In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath

Summary

AI-Generated Summary

PDF132December 20, 2024