代碼生成的結果優化過程監督

摘要

大型語言模型展示了在程式碼生成方面的卓越能力，然而在需要深度演算推理的複雜程式設計任務中通常會遇到困難。儘管透過學習獎勵模型進行過程監督在引導推理步驟方面顯示出潛力，但它需要昂貴的訓練數據並且存在評估不可靠的問題。我們提出了一種新穎的「結果精煉過程監督」範式，將結果精煉本身視為需要監督的過程。我們的框架利用具體的執行信號來基於推理步驟進行監督，同時使用樹狀結構的探索來同時維護多個解決方案軌跡。實驗表明，我們的方法使得即使較小的模型也能在競爭性程式設計任務中實現高成功準確度和性能指標，比傳統獎勵模型創造出更可靠的驗證，而無需訓練 PRMs。我們的方法在5個模型和3個數據集上實現了顯著改進：平均正確性提高了26.9％，效率提高了42.2％。結果表明，提供具體驗證信號的結構化推理空間對解決複雜程式設計任務至關重要。我們將所有的程式碼和數據開源，網址為：https://github.com/zhuohaoyu/ORPS

English

Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: https://github.com/zhuohaoyu/ORPS

代碼生成的結果優化過程監督

Outcome-Refining Process Supervision for Code Generation

摘要

Summary

Support