IterPref：通过迭代调试实现代码生成的焦点偏好学习

摘要

偏好学习通过利用相对质量比较，将代码大语言模型（Code LLMs）提升至超越监督微调的水平。现有方法基于测试用例的成功率构建候选代码对的偏好关系，将通过率较高的样本视为正例，较低的视为负例。然而，这种方法未能精确定位代码中的具体错误，阻碍了模型学习更具信息量的错误修正模式，因为将失败代码整体对齐缺乏捕捉有意义错误解决关系所需的细粒度。为解决这些问题，我们提出了IterPref，一种新的偏好对齐框架，它模拟人类迭代调试过程以优化Code LLMs。IterPref明确地定位错误区域，并通过定制的DPO算法对齐相应的标记。为生成信息丰富的对比对，我们引入了CodeFlow数据集，其中样本经过迭代优化直至通过测试，修改部分捕捉了错误修正。大量实验表明，配备IterPref的多种Code LLMs在代码生成任务上取得了显著的性能提升，并在BigCodeBench等挑战性任务上表现更优。深入分析显示，IterPref产生的错误更少。我们的代码和数据将公开提供。

English

Preference learning enhances Code LLMs beyond supervised fine-tuning by leveraging relative quality comparisons. Existing methods construct preference pairs from candidates based on test case success, treating the higher pass rate sample as positive and the lower as negative. However, this approach does not pinpoint specific errors in the code, which prevents the model from learning more informative error correction patterns, as aligning failing code as a whole lacks the granularity needed to capture meaningful error-resolution relationships. To address these issues, we propose IterPref, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. IterPref explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To generate informative pairs, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with IterPref achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that IterPref yields fewer errors. Our code and data will be made publicaly available.

IterPref：通过迭代调试实现代码生成的焦点偏好学习

IterPref: Focal Preference Learning for Code Generation via Iterative Debugging

摘要

Summary

Support

Support