코드 생성을 위한 결과 정제 프로세스 감독

초록

대형 언어 모델은 코드 생성에서 놀라운 능력을 보여주었지만, 심층 알고리즘적 추론이 필요한 복잡한 프로그래밍 작업에서 종종 어려움을 겪습니다. 학습된 보상 모델을 통한 과정 감독은 추론 단계를 안내하는 데 유망하나 비싼 훈련 데이터가 필요하고 신뢰할 수 없는 평가 결과를 보여줍니다. 우리는 결과 정제 프로세스 감독(Outcome-Refining Process Supervision)을 제안합니다. 이는 결과 정제 자체를 감독해야 하는 프로세스로 취급하는 새로운 패러다임입니다. 우리의 프레임워크는 추론 단계의 감독을 뿌리 깊은 실행 신호를 활용하여 이루며, 동시에 여러 해결 경로를 유지하기 위해 트리 구조화된 탐색을 사용합니다. 실험 결과는 우리의 방법이 심지어 작은 모델들이 경쟁적 프로그래밍 작업에서 높은 성공 정확도와 성능 지표를 달성하도록 하는 것을 보여주며, 전통적인 보상 모델보다 더 신뢰할 수 있는 검증을 제공하면서 훈련 PRM이 필요하지 않습니다. 우리의 방법은 5개 모델과 3개 데이터셋 전반에서 상당한 개선을 이루었습니다: 정확도는 평균 26.9% 증가하고 효율성은 42.2% 향상되었습니다. 결과는 구조화된 추론 공간을 구체적인 검증 신호로 제공하는 것이 복잡한 프로그래밍 작업을 해결하는 데 중요하다는 것을 시사합니다. 우리는 모든 코드와 데이터를 다음에서 오픈 소스로 제공합니다: https://github.com/zhuohaoyu/ORPS

English

Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: https://github.com/zhuohaoyu/ORPS

코드 생성을 위한 결과 정제 프로세스 감독

Outcome-Refining Process Supervision for Code Generation

초록

Support