AdaptiveStep: 모델 신뢰도를 통해 추론 단계를 자동으로 분할하기

초록

현재 프로세스 보상 모델(PRM)을 훈련하기 위한 접근 방식은 주로 미리 정의된 플레이스홀더 토큰을 사용하거나 추론 단계의 길이를 고정된 크기로 설정하는 등 규칙 기반 기법을 통해 응답을 여러 추론 단계로 분할하는 방식을 취합니다. 이러한 접근 방식은 특정 단어가 텍스트 내에서 진정한 의사결정 지점을 표시하지 않는다는 사실을 간과하고 있습니다. 이를 해결하기 위해, 우리는 모델이 다음 단어를 예측하는 데 대한 확신을 기반으로 추론 단계를 분할하는 AdaptiveStep 방법을 제안합니다. 이 분할 방법은 각 단계에서 더 많은 의사결정 정보를 제공함으로써 보상 모델 학습과 같은 다운스트림 작업을 향상시킵니다. 또한, 우리의 방법은 수동 주석이 필요하지 않습니다. 우리는 수학적 추론 및 코드 생성 작업에서 AdaptiveStep으로 훈련된 PRM을 통해 그 효과를 입증합니다. 실험 결과는 결과 PRM이 토큰 수준의 값 기반 디코딩을 사용한 탐욕적 탐색 전략을 능가하는 최신 Best-of-N 성능을 달성하는 동시에, 기존 오픈소스 PRM 대비 구축 비용을 30% 이상 절감함을 보여줍니다. 또한, 우리는 PRM의 성능, 전이성 및 일반화 능력에 대한 심층 분석과 사례 연구를 제공합니다.

English

Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.

AdaptiveStep: 모델 신뢰도를 통해 추론 단계를 자동으로 분할하기

AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence

초록

Support