CodeArena:大型语言模型代码生成的集体评估平台
CodeArena: A Collective Evaluation Platform for LLM Code Generation
March 3, 2025
作者: Mingzhe Du, Anh Tuan Luu, Bin Ji, Xiaobao Wu, Dong Huang, Terry Yue Zhuo, Qian Liu, See-Kiong Ng
cs.AI
摘要
大型语言模型(LLMs)通过深度融合其对自然语言与编程语法的卓越理解,重塑了代码生成领域,从而显著提升了开发者的生产力。这些进步催生了大量旨在定量评估其编码能力的研究。然而,诸如基准泄露、数据消散及系统访问受限等持续存在的挑战,仍然阻碍着及时且准确的评估。为应对这些局限,我们推出了CodeArena,一个专为LLM代码生成设计的在线评估框架。其核心创新在于集体评估机制,该机制基于所有参与模型的整体表现动态调整单个模型的评分,有效缓解了因广泛基准泄露导致的评分偏差。此外,CodeArena确保所有提交的解决方案与测试案例的公开访问,并提供自动化友好的API,以简化代码评估流程。我们的主要贡献包括:(1)一个实现无偏评估的集体评估系统,(2)一个公开的解决方案与测试案例库,以及(3)便于自动化集成的API接口。
English
Large Language Models (LLMs) have reshaped code generation by synergizing
their exceptional comprehension of natural language and programming syntax,
thereby substantially boosting developer productivity. These advancements have
prompted numerous efforts to quantitatively evaluate their coding capabilities.
However, persistent challenges, such as benchmark leakage, data dissipation,
and limited system accessibility, continue to impede a timely and accurate
assessment. To address these limitations, we introduce CodeArena, an online
evaluation framework tailored for LLM code generation. The key innovation is a
collective evaluation mechanism, which dynamically recalibrates individual
model scores based on the holistic performance of all participating models,
mitigating score biases caused by widespread benchmark leakage. In addition,
CodeArena ensures open access to all submitted solutions and test cases and
provides automation-friendly APIs to streamline the code evaluation workflow.
Our main contributions are: (1) a collective evaluation system for unbiased
assessment, (2) a public repository of solutions and test cases, and (3)
automation-ready APIs for seamless integration.Summary
AI-Generated Summary