通过强化学习教授语言模型批判能力
Teaching Language Models to Critique via Reinforcement Learning
February 5, 2025
作者: Zhihui Xie, Jie chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong
cs.AI
摘要
教授大型语言模型(LLMs)批判并完善其输出对于构建能够迭代改进的系统至关重要,然而,这在根本上受到提供准确判断和可操作建议能力的限制。在这项工作中,我们研究了用于代码生成的LLM评论者,并提出了CTRL,即通过强化学习进行评论者训练的框架,该框架训练评论者模型生成反馈,以最大化对于固定生成模型的纠正性能,而无需人类监督。我们的结果表明,使用CTRL训练的评论者显著增强了通过率,并减轻了基础和更强生成模型之间的复合错误。此外,我们展示这些评论者模型作为准确的生成奖励模型,并通过迭代的评论-修订实现了测试时的扩展,从而在具有挑战性的代码生成基准测试中实现高达106.1%的相对改进。
English
Teaching large language models (LLMs) to critique and refine their outputs is
crucial for building systems that can iteratively improve, yet it is
fundamentally limited by the ability to provide accurate judgments and
actionable suggestions. In this work, we study LLM critics for code generation
and propose CTRL, a framework for Critic
Training via Reinforcement Learning, which
trains a critic model to generate feedback that maximizes correction
performance for a fixed generator model without human supervision. Our results
demonstrate that critics trained with CTRL significantly enhance
pass rates and mitigate compounding errors across both base and stronger
generator models. Furthermore, we show that these critic models act as accurate
generative reward models and enable test-time scaling through iterative
critique-revision, achieving up to 106.1% relative improvements across
challenging code generation benchmarks.Summary
AI-Generated Summary