通过强化重排序实现代码生成的迭代自训练
Iterative Self-Training for Code Generation via Reinforced Re-Ranking
April 13, 2025
作者: Nikita Sorokin, Ivan Sedykh, Valentin Malykh
cs.AI
摘要
生成高质量代码以解决复杂编程任务颇具挑战性,尤其是在当前基于解码器的模型产生高度随机输出的情况下。在代码生成过程中,即便是细微的错误也可能导致整个解决方案失效。利用多个采样解决方案可以显著提升整体输出质量。
一种有效提升代码生成质量的方法是将代码生成模型与重排序模型相结合,后者从生成的样本中选出最佳解决方案。我们提出了一种新颖的迭代自训练方法,采用近端策略优化(PPO)来自训练重排序模型,旨在提高重排序准确性和整体代码生成过程。与传统的PPO方法不同,后者侧重于通过奖励模型优化生成模型,我们的方法则强调开发一个稳健的奖励/重排序模型。该模型通过重排序提升生成代码的质量,并解决在PPO与重排序器对齐过程中奖励模型可能忽略的问题和错误。我们的方法通过重新评估输出、识别高分的负面示例并将其纳入训练循环,迭代地优化训练数据集,从而提升模型性能。
在MultiPL-E数据集上的评估显示,我们的13.4B参数模型在代码生成质量上超越了33B模型,且速度提高了三倍。此外,它在性能上可与GPT-4媲美,并在一种编程语言上超越了GPT-4。
English
Generating high-quality code that solves complex programming tasks is
challenging, especially with current decoder-based models that produce highly
stochastic outputs. In code generation, even minor errors can easily break the
entire solution. Leveraging multiple sampled solutions can significantly
improve the overall output quality.
One effective way to enhance code generation is by pairing a code generation
model with a reranker model, which selects the best solution from the generated
samples. We propose a novel iterative self-training approach for self-training
reranker models using Proximal Policy Optimization (PPO), aimed at improving
both reranking accuracy and the overall code generation process. Unlike
traditional PPO approaches, where the focus is on optimizing a generative model
with a reward model, our approach emphasizes the development of a robust
reward/reranking model. This model improves the quality of generated code
through reranking and addresses problems and errors that the reward model might
overlook during PPO alignment with the reranker. Our method iteratively refines
the training dataset by re-evaluating outputs, identifying high-scoring
negative examples, and incorporating them into the training loop, that boosting
model performance.
Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter
model outperforms a 33B model in code generation quality while being three
times faster. Moreover, it achieves performance comparable to GPT-4 and
surpasses it in one programming language.Summary
AI-Generated Summary