AceMath: 사후 훈련 및 보상 모델링을 활용한 선두적인 수학 추론 발전

초록

본 논문에서는 복잡한 수학 문제를 해결하는 데 뛰어난 선두 수학 모델 모음인 AceMath를 소개합니다. 또한 생성된 솔루션을 평가하고 올바른 솔루션을 신뢰할 수 있는 방식으로 식별할 수 있는 매우 효과적인 보상 모델도 함께 소개합니다. 지시어에 맞게 조정된 수학 모델을 개발하기 위해 우리는 먼저 일반 도메인에서 경쟁력 있는 성능을 달성하는 감독된 미세 조정(SFT) 프로세스를 제안하고, 이어서 주의 깊게 선별된 프롬프트 세트와 합성으로 생성된 응답을 사용하여 수학 도메인을 위한 특정 미세 조정을 수행합니다. 결과적으로 AceMath-72B-Instruct 모델은 Qwen2.5-Math-72B-Instruct, GPT-4o 및 Claude-3.5 Sonnet을 크게 앞서나갑니다. 수학에 특화된 보상 모델을 개발하기 위해 먼저 AceMath-RewardBench를 구축하여 다양한 문제와 난이도 수준에서 수학 보상 모델을 평가하는 포괄적이고 견고한 벤치마크를 제시합니다. 그 후에는 수학 보상 모델을 구축하기 위한 체계적인 방법을 제시합니다. 결과적으로 AceMath-72B-RM 모델은 최첨단 보상 모델을 일관되게 앞섭니다. 또한 AceMath-72B-Instruct와 AceMath-72B-RM을 결합할 때 수학 추론 벤치마크 전체에서 가장 높은 평균 rm@8 점수를 달성합니다. 우리는 모델 가중치, 훈련 데이터 및 평가 벤치마크를 다음 링크에서 공개할 예정입니다: https://research.nvidia.com/labs/adlr/acemath

English

In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath

AceMath: 사후 훈련 및 보상 모델링을 활용한 선두적인 수학 추론 발전

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

초록

Summary

Support

Support