CodeElo:使用人类可比赛的Elo评分对LLM的竞赛级代码生成进行基准测试
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
January 2, 2025
作者: Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin
cs.AI
摘要
随着现有大型语言模型(LLMs)的代码推理能力不断增强,以及OpenAI o1和o3等推理模型的突破,迫切需要开发更具挑战性和全面性的基准,以有效测试它们复杂的竞赛级编码能力。现有的基准,如LiveCodeBench和USACO,存在私有测试用例不可用、不支持特殊评判和执行环境不匹配等问题。为填补这一空白,我们引入了CodeElo,这是一个标准化的竞赛级代码生成基准,首次有效解决了所有这些挑战。CodeElo基准主要基于官方CodeForces平台,并尽可能与该平台保持一致。我们整理了CodeForces最近六个月的比赛问题,包括比赛分级、问题难度评级和问题算法标签等详细信息。我们引入了独特的评判方法,其中问题直接提交到平台上,并开发了一个可靠的Elo评分计算系统,与平台保持一致,可与人类参与者相比但方差较低。通过在我们的CodeElo上进行测试,我们首次提供了30个现有流行的开源和3个专有LLMs的Elo评分。结果显示,o1-mini和QwQ-32B-Preview表现显著,分别获得了1578和1261的Elo评分,而其他模型即使在最简单的问题上也很困难,排在所有人类参与者中最低的20%。还进行了详细的分析实验,以提供有关算法性能和使用C++和Python的比较的见解,这可以为未来研究提供方向。
English
With the increasing code reasoning capabilities of existing large language
models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3,
there is a growing need to develop more challenging and comprehensive
benchmarks that effectively test their sophisticated competition-level coding
abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to
the unavailability of private test cases, lack of support for special judges,
and misaligned execution environments. To bridge this gap, we introduce
CodeElo, a standardized competition-level code generation benchmark that
effectively addresses all these challenges for the first time. CodeElo
benchmark is mainly based on the official CodeForces platform and tries to
align with the platform as much as possible. We compile the recent six months
of contest problems on CodeForces with detailed information such as contest
divisions, problem difficulty ratings, and problem algorithm tags. We introduce
a unique judging method in which problems are submitted directly to the
platform and develop a reliable Elo rating calculation system that aligns with
the platform and is comparable with human participants but has lower variance.
By testing on our CodeElo, we provide the Elo ratings of 30 existing popular
open-source and 3 proprietary LLMs for the first time. The results show that
o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of
1578 and 1261, respectively, while other models struggle even with the easiest
problems, placing in the lowest 20 percent among all human participants.
Detailed analysis experiments are also conducted to provide insights into
performance across algorithms and comparisons between using C++ and Python,
which can suggest directions for future studies.Summary
AI-Generated Summary