代碼獎勵建模的單元測試動態調整

摘要

目前的大型語言模型（LLMs）在像程式碼生成這樣的複雜推理任務中往往難以在第一次嘗試時產生準確的回應。先前的研究通過生成多個候選解決方案並使用LLM生成的單元測試來驗證這些解決方案來應對這一挑戰。單元測試的執行結果作為獎勵信號，用於識別正確解決方案。由於LLMs總是自信地犯錯，這些單元測試並不可靠，因此降低了獎勵信號的質量。受到將解決方案數量擴大有助於提高LLM性能的觀察的啟發，我們探索了擴大單元測試以增強獎勵信號質量的影響。我們的先驅性實驗揭示了單元測試數量與獎勵信號質量之間的正相關性，並且在更具挑戰性的問題中觀察到更大的好處。基於這些見解，我們提出了CodeRM-8B，這是一個輕量而有效的單元測試生成器，可實現高效且高質量的單元測試擴展。此外，我們實現了一個動態擴展機制，根據問題難度調整單元測試的數量，進一步提高效率。實驗結果顯示，我們的方法顯著提高了在三個基準測試上各種模型的性能（例如，Llama3-8B的增益為18.43％，GPT-4o-mini的增益為3.42％）。

English

Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).

代碼獎勵建模的單元測試動態調整

Dynamic Scaling of Unit Tests for Code Reward Modeling

摘要

Summary

Support