강화 학습을 통해 언어 모델에 비평을 가르치기

초록

대규모 언어 모델 (LLM)에게 자신의 출력물을 비평하고 개선시키는 방법을 가르치는 것은 시스템을 반복적으로 개선할 수 있는 데 중요하지만, 정확한 판단과 실행 가능한 제안을 제공하는 능력에 근본적으로 제한을 받습니다. 본 연구에서는 코드 생성을 위한 LLM 비평가를 연구하고 Critic Training via Reinforcement Learning (CTRL)이라는 프레임워크를 제안합니다. 이 프레임워크는 비평가 모델을 훈련하여 인간 감독 없이 고정된 생성자 모델에 대한 수정 성능을 극대화하는 피드백을 생성하도록 합니다. 결과는 CTRL로 훈련된 비평가가 기본 및 강력한 생성자 모델 모두에서 통과율을 크게 향상시키고 복합 오류를 완화하는 것을 보여줍니다. 또한 이러한 비평가 모델이 정확한 생성적 보상 모델로 작용하고 반복적인 비평-개정을 통해 테스트 시 스케일링을 가능하게 하며, 어려운 코드 생성 벤치마크에서 최대 106.1%의 상대적 향상을 달성하는 것을 보여줍니다.

English

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose CTRL, a framework for Critic Training via Reinforcement Learning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with CTRL significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.

강화 학습을 통해 언어 모델에 비평을 가르치기

Teaching Language Models to Critique via Reinforcement Learning

초록

Summary

Support