代码生成的结果细化过程监督
Outcome-Refining Process Supervision for Code Generation
December 19, 2024
作者: Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang
cs.AI
摘要
大型语言模型在代码生成方面展现出卓越的能力,但往往在需要深入算法推理的复杂编程任务中遇到困难。虽然通过学习奖励模型进行过程监督在引导推理步骤方面表现出潜力,但它需要昂贵的训练数据,并且评估不够可靠。我们提出了一种名为“结果细化过程监督”的新范式,将结果细化本身视为需要监督的过程。我们的框架利用具体执行信号来基于推理步骤的监督,同时利用树形探索来同时维护多个解决方案轨迹。实验证明,我们的方法使得即使较小的模型也能在竞争性编程任务中实现高成功准确率和性能指标,比传统奖励模型创造更可靠的验证,而无需训练PRMs。我们的方法在5个模型和3个数据集上取得显著改进:正确性平均提高26.9%,效率提高42.2%。结果表明,为解决复杂编程任务,提供具体验证信号的结构化推理空间至关重要。我们在https://github.com/zhuohaoyu/ORPS 开源了所有代码和数据。
English
Large Language Models have demonstrated remarkable capabilities in code
generation, yet they often struggle with complex programming tasks that require
deep algorithmic reasoning. While process supervision through learned reward
models shows promise in guiding reasoning steps, it requires expensive training
data and suffers from unreliable evaluation. We propose Outcome-Refining
Process Supervision, a novel paradigm that treats outcome refinement itself as
the process to be supervised. Our framework leverages concrete execution
signals to ground the supervision of reasoning steps, while using
tree-structured exploration to maintain multiple solution trajectories
simultaneously. Experiments demonstrate that our approach enables even smaller
models to achieve high success accuracy and performance metrics on competitive
programming tasks, creates more reliable verification than traditional reward
models without requiring training PRMs. Our approach achieves significant
improvements across 5 models and 3 datasets: an average of 26.9% increase in
correctness and 42.2% in efficiency. The results suggest that providing
structured reasoning space with concrete verification signals is crucial for
solving complex programming tasks. We open-source all our code and data at:
https://github.com/zhuohaoyu/ORPSSummary
AI-Generated Summary