ChatPaper.aiChatPaper

LeetCodeDataset:一个用于代码大语言模型稳健评估与高效训练的时序数据集

LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

April 20, 2025
作者: Yunhui Xia, Wei Shen, Yan Wang, Jason Klein Liu, Huifeng Sun, Siyue Wu, Jian Hu, Xiaolong Xu
cs.AI

摘要

我们推出LeetCodeDataset,这是一个用于评估和训练代码生成模型的高质量基准,解决了LLM研究中的两大关键挑战:缺乏以推理为核心的编码基准和自包含的训练测试平台。通过精心整理LeetCode的Python题目,配备丰富的元数据、广泛的覆盖范围、每道题目超过100个测试用例以及时间分割(2024年7月前后),我们的数据集实现了无污染评估和高效的监督微调(SFT)。实验表明,推理模型显著优于非推理模型,而仅使用2.6K个模型生成解决方案进行SFT,即可达到与11万样本相当的性能。该数据集及评估框架已在Hugging Face和Github上公开。
English
We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds. By curating LeetCode Python problems with rich metadata, broad coverage, 100+ test cases per problem, and temporal splits (pre/post July 2024), our dataset enables contamination-free evaluation and efficient supervised fine-tuning (SFT). Experiments show reasoning models significantly outperform non-reasoning counterparts, while SFT with only 2.6K model-generated solutions achieves performance comparable to 110K-sample counterparts. The dataset and evaluation framework are available on Hugging Face and Github.

Summary

AI-Generated Summary

PDF182April 22, 2025