代码奖励建模单元测试的动态缩放

摘要

当前的大型语言模型（LLMs）通常在类似代码生成这样的复杂推理任务中很难在第一次尝试时产生准确的响应。先前的研究通过生成多个候选解决方案并使用LLM生成的单元测试对其进行验证来解决这一挑战。单元测试的执行结果作为奖励信号，用于识别正确的解决方案。由于LLMs总是自信地犯错，这些单元测试并不可靠，从而降低了奖励信号的质量。受到将解决方案数量扩展以提高LLM性能的观察启发，我们探讨了扩展单元测试以增强奖励信号质量的影响。我们的先驱性实验揭示了单元测试数量与奖励信号质量之间的正相关关系，更具挑战性的问题中观察到了更大的益处。基于这些见解，我们提出了CodeRM-8B，这是一个轻量而有效的单元测试生成器，可以实现高效且高质量的单元测试扩展。此外，我们实现了一个动态扩展机制，根据问题难度调整单元测试数量，进一步提高了效率。实验结果表明，我们的方法显著提高了在三个基准测试上各种模型的性能（例如，Llama3-8B的增益为18.43%，GPT-4o-mini的增益为3.42%）。

English

Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).

代码奖励建模单元测试的动态缩放

Dynamic Scaling of Unit Tests for Code Reward Modeling

摘要

Summary

Support