강화 학습을 통해 언어 모델을 자가 교정하도록 훈련하기

초록

자가 수정은 대형 언어 모델 (LLM)의 매우 바람직한 능력이지만 현대 LLM에서는 일관되게 효과가 없다는 사실이 발견되었습니다. 자가 수정을 훈련하기 위한 기존 방법은 여러 모델이 필요하거나 더 능숙한 모델이나 다른 형태의 지도에 의존합니다. 이에 우리는 다중 턴 온라인 강화 학습 (RL) 접근 방식인 SCoRe를 개발하여 완전히 자체 생성된 데이터를 사용하여 LLM의 자가 수정 능력을 크게 향상시킵니다. SCoRe를 구축하기 위해 우리는 먼저 오프라인 모델 생성 수정 추적에 대한 지도 미세 조정 (SFT) 변형이 자가 수정 행동을 심어주기에는 충분하지 않다는 것을 보여줍니다. 특히, SFT를 통한 훈련은 훈련 데이터와 모델 자체 응답 간의 분포 불일치로 인해 문제가 발생하거나 종종 효과적이지 않은 테스트 시간에만 선호되는 특정 수정 행동 모드를 암시적으로 선호합니다. SCoRe는 모델의 자체 생성 수정 추적 분포 하에서 훈련하고 적절한 정규화를 사용하여 학습 과정을 조절하여 테스트 시간에 효과적인 자가 수정 전략을 학습하도록 하여 주어진 프롬프트에 대해 고보상 응답을 단순히 맞추는 것이 아닌 자가 수정을 강화합니다. 이 정규화는 붕괴에 민감하지 않은 정책 초기화를 생성하기 위해 기본 모델에서 RL의 첫 번째 단계를 실행하고 훈련 중 자가 수정을 강화하기 위해 보상 보너스를 사용합니다. Gemini 1.0 Pro 및 1.5 Flash 모델에 적용한 결과, SCoRe는 MATH 및 HumanEval 벤치마크에서 각각 기본 모델의 자가 수정을 15.6% 및 9.1% 향상시켰습니다.

English

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

강화 학습을 통해 언어 모델을 자가 교정하도록 훈련하기

Training Language Models to Self-Correct via Reinforcement Learning

초록

Summary

Support

Support