评估和调整CodeLLMs以符合人类偏好

Evaluating and Aligning CodeLLMs on Human Preference

December 6, 2024
作者: Jian Yang, Jiaxi Yang, Ke Jin, Yibo Miao, Lei Zhang, Liqun Yang, Zeyu Cui, Yichang Zhang, Binyuan Hui, Junyang Lin
cs.AI

摘要

大型语言模型(codeLLMs)在代码生成方面取得了重大进展。大多数先前与代码相关的基准测试,包括各种编程练习及相应的测试用例,被用作评估代码LLMs性能和能力的常见指标。然而,当前的代码LLMs侧重于合成正确的代码片段,忽略了与人类偏好的一致性,其中查询应当从实际应用场景中抽样,模型生成的响应应满足人类偏好。为了弥合模型生成的响应与人类偏好之间的差距,我们提出了一个严谨的人工策划基准测试CodeArena,以模拟真实世界编码任务的复杂性和多样性,其中包括来自用户查询的397个高质量样本,涵盖40个类别和44种编程语言。此外,我们提出了一个多样化的合成指令语料库SynCode-Instruct(近20B标记),通过扩展网站上的指令来验证大规模合成指令微调的有效性,其中完全在合成指令数据上训练的Qwen2.5-SynCoder可以实现开源代码LLMs的顶尖性能。结果发现执行基准测试和CodeArena之间的性能差异。我们对40多个LLMs进行的系统实验揭示了开源SOTA代码LLMs(例如Qwen2.5-Coder)和专有LLMs(例如OpenAI o1)之间显著的性能差距,突显了人类偏好一致性的重要性。\url{https://codearenaeval.github.io/}
English
Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.\url{https://codearenaeval.github.io/ }

Summary

AI-Generated Summary

PDF472December 11, 2024