ChatPaper.aiChatPaper

LeetCodeDataset:一個用於穩健評估與高效訓練程式碼大型語言模型的時序資料集

LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

April 20, 2025
作者: Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, Xiaolong Xu
cs.AI

摘要

我們推出了LeetCodeDataset,這是一個用於評估和訓練代碼生成模型的高質量基準,解決了大型語言模型(LLM)研究中的兩個關鍵挑戰:缺乏以推理為核心的編碼基準以及自包含的訓練測試平台。通過精心挑選LeetCode的Python問題,並配備豐富的元數據、廣泛的覆蓋範圍、每個問題超過100個測試案例以及時間劃分(2024年7月前後),我們的數據集實現了無污染評估和高效的監督微調(SFT)。實驗表明,推理模型顯著優於非推理模型,而僅使用2.6K個模型生成的解決方案進行SFT,就能達到與使用110K樣本相當的性能。該數據集和評估框架已在Hugging Face和Github上公開。
English
We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.

Summary

AI-Generated Summary

PDF192April 22, 2025