ACECODER：通過自動測試用例合成精通Coder強化學習

摘要

近年來，程式碼模型的大部分進展都是由監督微調（SFT）推動的，而強化學習（RL）的潛力仍然大部分未被探索，主要是由於在程式碼領域中缺乏可靠的獎勵數據/模型。本文通過利用自動化的大規模測試用例合成來增強程式碼模型訓練來應對這一挑戰。具體來說，我們設計了一個流程，從現有的程式碼數據中生成大量（問題，測試用例）對。利用這些測試用例，我們基於對抽樣程序的通過率構建偏好對，以訓練具有 Bradley-Terry 損失的獎勵模型。通過最佳的 32 次抽樣，Llama-3.1-8B-Ins 平均提高了 10 分，Qwen2.5-Coder-7B-Ins 提高了 5 分，使得 7B 模型與 236B DeepSeek-V2.5 齊平。此外，我們使用兩種獎勵模型和測試用例通過獎勵進行強化學習，在 HumanEval、MBPP、BigCodeBench 和 LiveCodeBench（V4）中實現了持續的改進。值得注意的是，我們採用 R1 風格的訓練，直接從 Qwen2.5-Coder-base 開始，並展示了我們的強化學習訓練可以使模型在 HumanEval-plus 上提高超過 25\%，在 MBPP-plus 上提高 6\%，僅需 80 次優化步驟。我們相信我們的結果突顯了強化學習在程式碼模型中的巨大潛力。

English

Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.

ACECODER：通過自動測試用例合成精通Coder強化學習

ACECODER: Acing Coder RL via Automated Test-Case Synthesis

摘要

Summary

Support