CodeMMLU: 코드 이해 능력을 평가하기 위한 다중 작업 벤치마크인 CodeLLM의 능력

초록

Code Large Language Models (CodeLLMs)의 최근 발전은 주로 개방형 코드 생성 작업에 초점을 맞추었으며 종종 코드 이해와 이해의 중요한 측면을 무시해 왔습니다. 이 간극을 메우기 위해, 우리는 LLMs의 소프트웨어 및 코드 이해의 깊이를 평가하기 위해 설계된 포괄적인 객관식 문제-답변 벤치마크인 CodeMMLU를 제안합니다. CodeMMLU에는 다양한 도메인에서 가져온 10,000개 이상의 질문이 포함되어 있으며, 다양한 프로그래밍 언어를 통해 코드 분석, 결함 탐지 및 소프트웨어 공학 원칙을 포괄하는 작업을 포함하고 있습니다. 전통적인 벤치마크와는 달리, CodeMMLU는 모델이 코드에 대해 이성적으로 추론할 수 있는 능력을 평가하여 단순히 생성하는 것이 아니라, 복잡한 소프트웨어 개념 및 시스템에 대한 그들의 이해를 더 깊게 파악합니다. 우리의 광범위한 평가는 최첨단 모델조차 CodeMMLU에서 상당한 어려움을 겪는다는 것을 밝혀내며, 코드 생성 이상의 이해력의 결핍을 강조합니다. 코드 이해와 효과적인 생성 간의 중요한 관계를 강조함으로써, CodeMMLU는 AI를 보조로 하는 소프트웨어 개발을 발전시키는 데 중요한 자원으로 기능하며, 궁극적으로 더 신뢰할 수 있고 능력 있는 코딩 보조 도구를 만드는 것을 목표로 합니다.

English

Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models's ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.

CodeMMLU: 코드 이해 능력을 평가하기 위한 다중 작업 벤치마크인 CodeLLM의 능력

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs

초록

Summary

Support

Support