ACECODER:通过自动化测试用例合成实现编码者强化学习的高水平
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
February 3, 2025
作者: Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen
cs.AI
摘要
近年来编码器模型的大部分进展都是通过监督微调(SFT)推动的,而强化学习(RL)的潜力仍然大部分未被开发,主要是由于在代码领域缺乏可靠的奖励数据/模型。本文通过利用自动化大规模测试用例合成来增强代码模型训练来解决这一挑战。具体来说,我们设计了一个流程,从现有代码数据中生成大量(问题,测试用例)对。利用这些测试用例,我们基于对采样程序的通过率构建偏好对,以训练具有Bradley-Terry损失的奖励模型。通过最佳的32次采样,结果显示Llama-3.1-8B-Ins平均提升10分,Qwen2.5-Coder-7B-Ins提升5分,使得7B模型与236B DeepSeek-V2.5持平。此外,我们使用两种奖励模型和测试用例通过率进行强化学习,导致在HumanEval、MBPP、BigCodeBench和LiveCodeBench(V4)上持续改进。值得注意的是,我们采用R1风格的训练,直接从Qwen2.5-Coder-base开始,并展示我们的RL训练可以使HumanEval-plus模型提高超过25\%,MBPP-plus提高6%,仅需80次优化步骤。我们相信我们的结果突显了强化学习在编码器模型中的巨大潜力。
English
Most progress in recent coder models has been driven by supervised
fine-tuning (SFT), while the potential of reinforcement learning (RL) remains
largely unexplored, primarily due to the lack of reliable reward data/model in
the code domain. In this paper, we address this challenge by leveraging
automated large-scale test-case synthesis to enhance code model training.
Specifically, we design a pipeline that generates extensive (question,
test-cases) pairs from existing code data. Using these test cases, we construct
preference pairs based on pass rates over sampled programs to train reward
models with Bradley-Terry loss. It shows an average of 10-point improvement for
Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through
best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5.
Furthermore, we conduct reinforcement learning with both reward models and
test-case pass rewards, leading to consistent improvements across HumanEval,
MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style
training to start from Qwen2.5-Coder-base directly and show that our RL
training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\%
for merely 80 optimization steps. We believe our results highlight the huge
potential of reinforcement learning in coder models.Summary
AI-Generated Summary