KodCode：一个多样化、具挑战性且可验证的编程合成数据集

摘要

我们推出KodCode，这是一个合成数据集，旨在解决为训练大型语言模型进行编程而获取高质量、可验证训练数据这一长期挑战，覆盖多种难度和领域。现有的代码相关资源通常无法同时确保覆盖广度（如从简单编码任务到高级算法问题）和可验证的正确性（如单元测试）。相比之下，KodCode包含经过系统自验证过程验证的问题-解答-测试三元组。我们的流程首先合成广泛的编程问题，然后生成解答和测试用例，并为难题分配额外尝试。最后，通过将问题重写为多种格式，并基于测试的拒绝采样程序从推理模型（DeepSeek R1）生成响应，完成训练后数据合成。这一流程产出了一个大规模、稳健且多样化的编程数据集。KodCode适用于监督微调，其配对的单元测试也为强化学习调优提供了巨大潜力。在编程基准测试（HumanEval(+)、MBPP(+)、BigCodeBench和LiveCodeBench）上的微调实验表明，经KodCode微调的模型实现了最先进的性能，超越了如Qwen2.5-Coder-32B-Instruct和DeepSeek-R1-Distill-Llama-70B等模型。

English

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

KodCode：一个多样化、具挑战性且可验证的编程合成数据集

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

摘要

Summary

Support

Support