代码生成的结果细化过程监督

摘要

大型语言模型在代码生成方面展现出卓越的能力，但往往在需要深入算法推理的复杂编程任务中遇到困难。虽然通过学习奖励模型进行过程监督在引导推理步骤方面表现出潜力，但它需要昂贵的训练数据，并且评估不够可靠。我们提出了一种名为“结果细化过程监督”的新范式，将结果细化本身视为需要监督的过程。我们的框架利用具体执行信号来基于推理步骤的监督，同时利用树形探索来同时维护多个解决方案轨迹。实验证明，我们的方法使得即使较小的模型也能在竞争性编程任务中实现高成功准确率和性能指标，比传统奖励模型创造更可靠的验证，而无需训练PRMs。我们的方法在5个模型和3个数据集上取得显著改进：正确性平均提高26.9%，效率提高42.2%。结果表明，为解决复杂编程任务，提供具体验证信号的结构化推理空间至关重要。我们在https://github.com/zhuohaoyu/ORPS 开源了所有代码和数据。

English

Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Process Supervision, a novel paradigm that treats outcome refinement itself as the process to be supervised. Our framework leverages concrete execution signals to ground the supervision of reasoning steps, while using tree-structured exploration to maintain multiple solution trajectories simultaneously. Experiments demonstrate that our approach enables even smaller models to achieve high success accuracy and performance metrics on competitive programming tasks, creates more reliable verification than traditional reward models without requiring training PRMs. Our approach achieves significant improvements across 5 models and 3 datasets: an average of 26.9% increase in correctness and 42.2% in efficiency. The results suggest that providing structured reasoning space with concrete verification signals is crucial for solving complex programming tasks. We open-source all our code and data at: https://github.com/zhuohaoyu/ORPS

代码生成的结果细化过程监督

Outcome-Refining Process Supervision for Code Generation

摘要

Support