評估並校準 CodeLLMs 以符合人類偏好

Evaluating and Aligning CodeLLMs on Human Preference

December 6, 2024
作者: Jian Yang, Jiaxi Yang, Ke Jin, Yibo Miao, Lei Zhang, Liqun Yang, Zeyu Cui, Yichang Zhang, Binyuan Hui, Junyang Lin
cs.AI

摘要

大型語言模型(codeLLMs)在程式碼生成方面取得了重大進展。先前的程式碼相關基準測試,包括各種程式設計練習和相應的測試案例,被用作評估程式碼LLMs性能和能力的共同標準。然而,目前的程式碼LLMs專注於合成正確的程式碼片段,忽略了與人類偏好的一致性,其中查詢應該來自實際應用場景,而模型生成的回應應滿足人類偏好。為了彌合模型生成的回應與人類偏好之間的差距,我們提出了一個嚴謹的人工策劃基準CodeArena,以模擬現實世界編碼任務的複雜性和多樣性,其中包括來自用戶查詢的397個高質量樣本,涵蓋40個類別和44種程式語言。此外,我們提出了一個多樣化的合成指令語料庫SynCode-Instruct(近20B個標記),通過從網站擴展指令來驗證大規模合成指令微調的有效性,其中完全在合成指令數據上訓練的Qwen2.5-SynCoder可以實現開源程式碼LLMs的頂尖性能。研究結果發現執行基準測試和CodeArena之間的性能差異。我們對40多個LLMs的CodeArena進行系統性實驗,揭示了開源SOTA程式碼LLMs(例如Qwen2.5-Coder)和專有LLMs(例如OpenAI o1)之間的顯著性能差距,突顯了人類偏好一致性的重要性。
English
Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.\url{https://codearenaeval.github.io/ }

Summary

AI-Generated Summary

PDF472December 11, 2024