CodeElo：使用人類可比擬 Elo 等級對 LLMs 的競賽級程式碼生成進行基準測試

摘要

隨著現有大型語言模型（LLMs）在程式碼推理能力上的不斷提升，以及像 OpenAI o1 和 o3 這樣的推理模型的突破，迫切需要開發更具挑戰性和全面性的基準，以有效測試它們複雜的競賽級編碼能力。現有的基準，如 LiveCodeBench 和 USACO，由於私有測試案例不可用、不支援特殊評分標準以及執行環境不一致，存在不足之處。為彌補這一差距，我們引入了 CodeElo，這是一個標準化的競賽級程式碼生成基準，首次有效應對了所有這些挑戰。CodeElo 基準主要基於官方 CodeForces 平台，並盡可能與該平台保持一致。我們彙編了 CodeForces 最近六個月的比賽問題，包括比賽組別、問題難度評分和問題算法標籤等詳細信息。我們引入了一種獨特的評分方法，其中問題直接提交到平台上，並開發了一個可靠的 Elo 評分計算系統，與平台保持一致，並可與人類參與者進行比較，但變異性較低。通過在我們的 CodeElo 上進行測試，我們首次提供了 30 個現有熱門開源和 3 個專有 LLM 的 Elo 評分。結果顯示，o1-mini 和 QwQ-32B-Preview 顯著突出，分別達到 1578 和 1261 的 Elo 評分，而其他模型即使在最簡單的問題上也表現不佳，在所有人類參與者中排名最低的 20%。還進行了詳細的分析實驗，以提供跨算法性能和使用 C++ 和 Python 之間的比較，這可以為未來研究提供方向。

English

With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.

CodeElo：使用人類可比擬 Elo 等級對 LLMs 的競賽級程式碼生成進行基準測試

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

摘要

Support