CodeMMLU:一個用於評估代碼理解能力的多任務基準測試,針對CodeLLM。
CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs
October 2, 2024
作者: Dung Nguyen Manh, Thang Phan Chau, Nam Le Hai, Thong T. Doan, Nam V. Nguyen, Quang Pham, Nghi D. Q. Bui
cs.AI
摘要
最近對於大型程式碼語言模型(Code Large Language Models,CodeLLMs)的進展主要集中在開放式程式碼生成任務,往往忽略了程式碼理解和理解的關鍵方面。為彌補這一差距,我們提出了CodeMMLU,這是一個全面的多項選擇問答基準,旨在評估LLMs中軟體和程式碼理解的深度。CodeMMLU包含來自不同領域的超過10,000個問題,涵蓋了程式碼分析、缺陷檢測和跨多種程式語言的軟體工程原則等任務。與傳統基準不同,CodeMMLU評估模型推理程式碼的能力,而不僅僅是生成程式碼,提供對其對複雜軟體概念和系統的掌握更深入的洞察。我們的廣泛評估顯示,即使是最先進的模型在CodeMMLU上也面臨著重大挑戰,突顯了在程式碼生成之外的理解方面的不足。通過強調程式碼理解與有效生成之間的關鍵關係,CodeMMLU作為推進AI輔助軟體開發的重要資源,最終旨在創建更可靠和能幹的編碼助手。
English
Recent advancements in Code Large Language Models (CodeLLMs) have
predominantly focused on open-ended code generation tasks, often neglecting the
critical aspect of code understanding and comprehension. To bridge this gap, we
present CodeMMLU, a comprehensive multiple-choice question-answer benchmark
designed to evaluate the depth of software and code understanding in LLMs.
CodeMMLU includes over 10,000 questions sourced from diverse domains,
encompassing tasks such as code analysis, defect detection, and software
engineering principles across multiple programming languages. Unlike
traditional benchmarks, CodeMMLU assesses models's ability to reason about code
rather than merely generate it, providing deeper insights into their grasp of
complex software concepts and systems. Our extensive evaluation reveals that
even state-of-the-art models face significant challenges with CodeMMLU,
highlighting deficiencies in comprehension beyond code generation. By
underscoring the crucial relationship between code understanding and effective
generation, CodeMMLU serves as a vital resource for advancing AI-assisted
software development, ultimately aiming to create more reliable and capable
coding assistants.Summary
AI-Generated Summary