CodeElo：使用人类可比赛的Elo评分对LLM的竞赛级代码生成进行基准测试

摘要

随着现有大型语言模型（LLMs）的代码推理能力不断增强，以及OpenAI o1和o3等推理模型的突破，迫切需要开发更具挑战性和全面性的基准，以有效测试它们复杂的竞赛级编码能力。现有的基准，如LiveCodeBench和USACO，存在私有测试用例不可用、不支持特殊评判和执行环境不匹配等问题。为填补这一空白，我们引入了CodeElo，这是一个标准化的竞赛级代码生成基准，首次有效解决了所有这些挑战。CodeElo基准主要基于官方CodeForces平台，并尽可能与该平台保持一致。我们整理了CodeForces最近六个月的比赛问题，包括比赛分级、问题难度评级和问题算法标签等详细信息。我们引入了独特的评判方法，其中问题直接提交到平台上，并开发了一个可靠的Elo评分计算系统，与平台保持一致，可与人类参与者相比但方差较低。通过在我们的CodeElo上进行测试，我们首次提供了30个现有流行的开源和3个专有LLMs的Elo评分。结果显示，o1-mini和QwQ-32B-Preview表现显著，分别获得了1578和1261的Elo评分，而其他模型即使在最简单的问题上也很困难，排在所有人类参与者中最低的20%。还进行了详细的分析实验，以提供有关算法性能和使用C++和Python的比较的见解，这可以为未来研究提供方向。

English

With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.

CodeElo：使用人类可比赛的Elo评分对LLM的竞赛级代码生成进行基准测试

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

摘要

Summary

Support