評估並校準 CodeLLMs 以符合人類偏好
Evaluating and Aligning CodeLLMs on Human Preference
December 6, 2024
作者: Jian Yang, Jiaxi Yang, Ke Jin, Yibo Miao, Lei Zhang, Liqun Yang, Zeyu Cui, Yichang Zhang, Binyuan Hui, Junyang Lin
cs.AI
摘要
大型語言模型(codeLLMs)在程式碼生成方面取得了重大進展。先前的程式碼相關基準測試,包括各種程式設計練習和相應的測試案例,被用作評估程式碼LLMs性能和能力的共同標準。然而,目前的程式碼LLMs專注於合成正確的程式碼片段,忽略了與人類偏好的一致性,其中查詢應該來自實際應用場景,而模型生成的回應應滿足人類偏好。為了彌合模型生成的回應與人類偏好之間的差距,我們提出了一個嚴謹的人工策劃基準CodeArena,以模擬現實世界編碼任務的複雜性和多樣性,其中包括來自用戶查詢的397個高質量樣本,涵蓋40個類別和44種程式語言。此外,我們提出了一個多樣化的合成指令語料庫SynCode-Instruct(近20B個標記),通過從網站擴展指令來驗證大規模合成指令微調的有效性,其中完全在合成指令數據上訓練的Qwen2.5-SynCoder可以實現開源程式碼LLMs的頂尖性能。研究結果發現執行基準測試和CodeArena之間的性能差異。我們對40多個LLMs的CodeArena進行系統性實驗,揭示了開源SOTA程式碼LLMs(例如Qwen2.5-Coder)和專有LLMs(例如OpenAI o1)之間的顯著性能差距,突顯了人類偏好一致性的重要性。
English
Code large language models (codeLLMs) have made significant strides in code
generation. Most previous code-related benchmarks, which consist of various
programming exercises along with the corresponding test cases, are used as a
common measure to evaluate the performance and capabilities of code LLMs.
However, the current code LLMs focus on synthesizing the correct code snippet,
ignoring the alignment with human preferences, where the query should be
sampled from the practical application scenarios and the model-generated
responses should satisfy the human preference. To bridge the gap between the
model-generated response and human preference, we present a rigorous
human-curated benchmark CodeArena to emulate the complexity and diversity of
real-world coding tasks, where 397 high-quality samples spanning 40 categories
and 44 programming languages, carefully curated from user queries. Further, we
propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B
tokens) by scaling instructions from the website to verify the effectiveness of
the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder
totally trained on synthetic instruction data can achieve top-tier performance
of open-source code LLMs. The results find performance differences between
execution-based benchmarks and CodeArena. Our systematic experiments of
CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code
LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring
the importance of the human preference
alignment.\url{https://codearenaeval.github.io/ }Summary
AI-Generated Summary