通过强化学习教授语言模型批判能力

摘要

教授大型语言模型（LLMs）批判并完善其输出对于构建能够迭代改进的系统至关重要，然而，这在根本上受到提供准确判断和可操作建议能力的限制。在这项工作中，我们研究了用于代码生成的LLM评论者，并提出了CTRL，即通过强化学习进行评论者训练的框架，该框架训练评论者模型生成反馈，以最大化对于固定生成模型的纠正性能，而无需人类监督。我们的结果表明，使用CTRL训练的评论者显著增强了通过率，并减轻了基础和更强生成模型之间的复合错误。此外，我们展示这些评论者模型作为准确的生成奖励模型，并通过迭代的评论-修订实现了测试时的扩展，从而在具有挑战性的代码生成基准测试中实现高达106.1％的相对改进。

English

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose CTRL, a framework for Critic Training via Reinforcement Learning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with CTRL significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.

通过强化学习教授语言模型批判能力

Teaching Language Models to Critique via Reinforcement Learning

摘要

Summary

Support