CodeMMLU：一個用於評估代碼理解能力的多任務基準測試，針對CodeLLM。

摘要

最近對於大型程式碼語言模型（Code Large Language Models，CodeLLMs）的進展主要集中在開放式程式碼生成任務，往往忽略了程式碼理解和理解的關鍵方面。為彌補這一差距，我們提出了CodeMMLU，這是一個全面的多項選擇問答基準，旨在評估LLMs中軟體和程式碼理解的深度。CodeMMLU包含來自不同領域的超過10,000個問題，涵蓋了程式碼分析、缺陷檢測和跨多種程式語言的軟體工程原則等任務。與傳統基準不同，CodeMMLU評估模型推理程式碼的能力，而不僅僅是生成程式碼，提供對其對複雜軟體概念和系統的掌握更深入的洞察。我們的廣泛評估顯示，即使是最先進的模型在CodeMMLU上也面臨著重大挑戰，突顯了在程式碼生成之外的理解方面的不足。通過強調程式碼理解與有效生成之間的關鍵關係，CodeMMLU作為推進AI輔助軟體開發的重要資源，最終旨在創建更可靠和能幹的編碼助手。

English

Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models's ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.

CodeMMLU：一個用於評估代碼理解能力的多任務基準測試，針對CodeLLM。

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs

摘要

Summary

Support

Support