CodeElo:使用人類可比擬 Elo 等級對 LLMs 的競賽級程式碼生成進行基準測試
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
January 2, 2025
作者: Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan Hui, Junyang Lin
cs.AI
摘要
隨著現有大型語言模型(LLMs)在程式碼推理能力上的不斷提升,以及像 OpenAI o1 和 o3 這樣的推理模型的突破,迫切需要開發更具挑戰性和全面性的基準,以有效測試它們複雜的競賽級編碼能力。現有的基準,如 LiveCodeBench 和 USACO,由於私有測試案例不可用、不支援特殊評分標準以及執行環境不一致,存在不足之處。為彌補這一差距,我們引入了 CodeElo,這是一個標準化的競賽級程式碼生成基準,首次有效應對了所有這些挑戰。CodeElo 基準主要基於官方 CodeForces 平台,並盡可能與該平台保持一致。我們彙編了 CodeForces 最近六個月的比賽問題,包括比賽組別、問題難度評分和問題算法標籤等詳細信息。我們引入了一種獨特的評分方法,其中問題直接提交到平台上,並開發了一個可靠的 Elo 評分計算系統,與平台保持一致,並可與人類參與者進行比較,但變異性較低。通過在我們的 CodeElo 上進行測試,我們首次提供了 30 個現有熱門開源和 3 個專有 LLM 的 Elo 評分。結果顯示,o1-mini 和 QwQ-32B-Preview 顯著突出,分別達到 1578 和 1261 的 Elo 評分,而其他模型即使在最簡單的問題上也表現不佳,在所有人類參與者中排名最低的 20%。還進行了詳細的分析實驗,以提供跨算法性能和使用 C++ 和 Python 之間的比較,這可以為未來研究提供方向。
English
With the increasing code reasoning capabilities of existing large language
models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3,
there is a growing need to develop more challenging and comprehensive
benchmarks that effectively test their sophisticated competition-level coding
abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to
the unavailability of private test cases, lack of support for special judges,
and misaligned execution environments. To bridge this gap, we introduce
CodeElo, a standardized competition-level code generation benchmark that
effectively addresses all these challenges for the first time. CodeElo
benchmark is mainly based on the official CodeForces platform and tries to
align with the platform as much as possible. We compile the recent six months
of contest problems on CodeForces with detailed information such as contest
divisions, problem difficulty ratings, and problem algorithm tags. We introduce
a unique judging method in which problems are submitted directly to the
platform and develop a reliable Elo rating calculation system that aligns with
the platform and is comparable with human participants but has lower variance.
By testing on our CodeElo, we provide the Elo ratings of 30 existing popular
open-source and 3 proprietary LLMs for the first time. The results show that
o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of
1578 and 1261, respectively, while other models struggle even with the easiest
problems, placing in the lowest 20 percent among all human participants.
Detailed analysis experiments are also conducted to provide insights into
performance across algorithms and comparisons between using C++ and Python,
which can suggest directions for future studies.Summary
AI-Generated Summary