CodeArena：大型语言模型代码生成的集体评估平台

摘要

大型语言模型（LLMs）通过深度融合其对自然语言与编程语法的卓越理解，重塑了代码生成领域，从而显著提升了开发者的生产力。这些进步催生了大量旨在定量评估其编码能力的研究。然而，诸如基准泄露、数据消散及系统访问受限等持续存在的挑战，仍然阻碍着及时且准确的评估。为应对这些局限，我们推出了CodeArena，一个专为LLM代码生成设计的在线评估框架。其核心创新在于集体评估机制，该机制基于所有参与模型的整体表现动态调整单个模型的评分，有效缓解了因广泛基准泄露导致的评分偏差。此外，CodeArena确保所有提交的解决方案与测试案例的公开访问，并提供自动化友好的API，以简化代码评估流程。我们的主要贡献包括：（1）一个实现无偏评估的集体评估系统，（2）一个公开的解决方案与测试案例库，以及（3）便于自动化集成的API接口。

English

Large Language Models (LLMs) have reshaped code generation by synergizing their exceptional comprehension of natural language and programming syntax, thereby substantially boosting developer productivity. These advancements have prompted numerous efforts to quantitatively evaluate their coding capabilities. However, persistent challenges, such as benchmark leakage, data dissipation, and limited system accessibility, continue to impede a timely and accurate assessment. To address these limitations, we introduce CodeArena, an online evaluation framework tailored for LLM code generation. The key innovation is a collective evaluation mechanism, which dynamically recalibrates individual model scores based on the holistic performance of all participating models, mitigating score biases caused by widespread benchmark leakage. In addition, CodeArena ensures open access to all submitted solutions and test cases and provides automation-friendly APIs to streamline the code evaluation workflow. Our main contributions are: (1) a collective evaluation system for unbiased assessment, (2) a public repository of solutions and test cases, and (3) automation-ready APIs for seamless integration.

CodeArena：大型语言模型代码生成的集体评估平台

CodeArena: A Collective Evaluation Platform for LLM Code Generation

摘要

Summary

Support

Support