通過強化重排序實現代碼生成的迭代自訓練
Iterative Self-Training for Code Generation via Reinforced Re-Ranking
April 13, 2025
作者: Nikita Sorokin, Ivan Sedykh, Valentin Malykh
cs.AI
摘要
生成高質量程式碼以解決複雜的編程任務具有挑戰性,尤其是在當前基於解碼器的模型產生高度隨機輸出的情況下。在程式碼生成中,即使是微小的錯誤也可能輕易破壞整個解決方案。利用多個採樣的解決方案可以顯著提升整體輸出品質。
一種有效提升程式碼生成品質的方法是將程式碼生成模型與重排序模型配對,後者從生成的樣本中選擇最佳解決方案。我們提出了一種新穎的迭代自我訓練方法,使用近端策略優化(PPO)來自我訓練重排序模型,旨在提高重排序準確性和整體程式碼生成過程。與傳統的PPO方法不同,傳統方法專注於使用獎勵模型來優化生成模型,而我們的方法則強調開發一個強大的獎勵/重排序模型。該模型通過重排序來提升生成程式碼的品質,並解決獎勵模型在與重排序器進行PPO對齊時可能忽略的問題和錯誤。我們的方法通過重新評估輸出、識別高分的負面樣本並將其納入訓練循環中,迭代地精煉訓練數據集,從而提升模型性能。
我們在MultiPL-E數據集上的評估顯示,我們的13.4B參數模型在程式碼生成品質上超越了33B模型,同時速度提高了三倍。此外,它在性能上與GPT-4相當,並在一種編程語言中超越了GPT-4。
English
Generating high-quality code that solves complex programming tasks is
challenging, especially with current decoder-based models that produce highly
stochastic outputs. In code generation, even minor errors can easily break the
entire solution. Leveraging multiple sampled solutions can significantly
improve the overall output quality.
One effective way to enhance code generation is by pairing a code generation
model with a reranker model, which selects the best solution from the generated
samples. We propose a novel iterative self-training approach for self-training
reranker models using Proximal Policy Optimization (PPO), aimed at improving
both reranking accuracy and the overall code generation process. Unlike
traditional PPO approaches, where the focus is on optimizing a generative model
with a reward model, our approach emphasizes the development of a robust
reward/reranking model. This model improves the quality of generated code
through reranking and addresses problems and errors that the reward model might
overlook during PPO alignment with the reranker. Our method iteratively refines
the training dataset by re-evaluating outputs, identifying high-scoring
negative examples, and incorporating them into the training loop, that boosting
model performance.
Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter
model outperforms a 33B model in code generation quality while being three
times faster. Moreover, it achieves performance comparable to GPT-4 and
surpasses it in one programming language.Summary
AI-Generated Summary