代碼生成的結果優化過程監督
Outcome-Refining Process Supervision for Code Generation
December 19, 2024
作者: Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang
cs.AI
摘要
大型語言模型展示了在程式碼生成方面的卓越能力,然而在需要深度演算推理的複雜程式設計任務中通常會遇到困難。儘管透過學習獎勵模型進行過程監督在引導推理步驟方面顯示出潛力,但它需要昂貴的訓練數據並且存在評估不可靠的問題。我們提出了一種新穎的「結果精煉過程監督」範式,將結果精煉本身視為需要監督的過程。我們的框架利用具體的執行信號來基於推理步驟進行監督,同時使用樹狀結構的探索來同時維護多個解決方案軌跡。實驗表明,我們的方法使得即使較小的模型也能在競爭性程式設計任務中實現高成功準確度和性能指標,比傳統獎勵模型創造出更可靠的驗證,而無需訓練 PRMs。我們的方法在5個模型和3個數據集上實現了顯著改進:平均正確性提高了26.9%,效率提高了42.2%。結果表明,提供具體驗證信號的結構化推理空間對解決複雜程式設計任務至關重要。我們將所有的程式碼和數據開源,網址為:https://github.com/zhuohaoyu/ORPS
English
Large Language Models have demonstrated remarkable capabilities in code
generation, yet they often struggle with complex programming tasks that require
deep algorithmic reasoning. While process supervision through learned reward
models shows promise in guiding reasoning steps, it requires expensive training
data and suffers from unreliable evaluation. We propose Outcome-Refining
Process Supervision, a novel paradigm that treats outcome refinement itself as
the process to be supervised. Our framework leverages concrete execution
signals to ground the supervision of reasoning steps, while using
tree-structured exploration to maintain multiple solution trajectories
simultaneously. Experiments demonstrate that our approach enables even smaller
models to achieve high success accuracy and performance metrics on competitive
programming tasks, creates more reliable verification than traditional reward
models without requiring training PRMs. Our approach achieves significant
improvements across 5 models and 3 datasets: an average of 26.9% increase in
correctness and 42.2% in efficiency. The results suggest that providing
structured reasoning space with concrete verification signals is crucial for
solving complex programming tasks. We open-source all our code and data at:
https://github.com/zhuohaoyu/ORPSSummary
AI-Generated Summary