수학적 추론에서 과정 보상 모델 개발의 교훈

초록

과정 보상 모델(PRMs)은 대규모 언어 모델(LLMs)의 수학적 추론에서의 과정 감독을 위한 유망한 접근 방식으로 나타나며, 중간 오류를 식별하고 완화하는 것을 목표로 합니다. 그러나 효과적인 PRMs의 개발은 데이터 주석 및 평가 방법론에서 주요한 도전에 직면하고 있습니다. 본 논문에서 우리는 광범위한 실험을 통해, 일반적으로 사용되는 몬테카를로(MC) 추정을 기반으로 한 PRMs의 데이터 합성이 LLM-판단 및 인간 주석 방법과 비교하여 일반적으로 성능과 일반화 면에서 미흡함을 입증합니다. MC 추정은 현재 단계의 정확성을 평가하기 위해 완료 모델에 의존하며, 이는 부정확한 단계 확인으로 이어집니다. 더 나아가 PRMs에 대한 기존 Best-of-N(BoN) 평가 전략에서 잠재적인 편향을 확인합니다: (1) 신뢰할 수 없는 정책 모델은 올바른 답변을 생성하지만 결함이 있는 과정으로 이어져 BoN의 평가 기준과 PRM의 과정 확인 목표 간의 불일치를 초래합니다. (2) 이러한 응답에 대한 PRMs의 관용성은 BoN 점수를 과대 평가하게 만듭니다. (3) 기존 PRMs는 최종 답변 단계에 집중된 최소 점수의 상당 비율을 가지고 있으며, BoN 최적화된 PRMs에서 과정에서 결과 중심 평가로의 전환을 드러냅니다. 이러한 도전에 대응하기 위해, MC 추정을 효과적으로 통합하고 LLM-판단과 결합하는 보편적 필터링 메커니즘을 개발하고, 응답 수준 및 단계 수준 메트릭을 결합한 보다 포괄적인 평가 프레임워크를 제안합니다. 이러한 메커니즘을 기반으로 BoN 평가 및 단계별 오류 식별 작업에서 모델 성능과 데이터 효율성을 크게 향상시킵니다. 마지막으로, 기존 오픈 소스 대안보다 우수한 새로운 최첨단 PRM을 공개하고, 과정 감독 모델 구축에 대한 미래 연구를 위한 실용적 가이드라인을 제공합니다.

English

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

수학적 추론에서 과정 보상 모델 개발의 교훈

The Lessons of Developing Process Reward Models in Mathematical Reasoning

초록

Summary

Support