通過強化學習訓練語言模型進行自我校正
Training Language Models to Self-Correct via Reinforcement Learning
September 19, 2024
作者: Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust
cs.AI
摘要
自我校正是大型語言模型(LLMs)中一個非常理想的能力,然而在現代LLMs中,它一直被發現在很大程度上是無效的。現有的自我校正訓練方法要麼需要多個模型,要麼依賴更強大的模型或其他形式的監督。為此,我們開發了一種多輪在線強化學習(RL)方法SCoRe,通過完全使用自生成的數據,顯著提高了LLM的自我校正能力。為構建SCoRe,我們首先展示了在離線模型生成的校正軌跡上變體的監督微調(SFT)是不足以灌輸自我校正行為的。具體而言,我們觀察到通過SFT進行訓練要麼受到訓練數據與模型自身回應之間的分佈不匹配的困擾,要麼隱式地偏好於某種在測試時通常不起作用的校正行為模式。SCoRe通過在模型自身生成的校正軌跡分佈下進行訓練,並使用適當的正則化來引導學習過程,使其學習一種在測試時有效的自我校正策略,而不僅僅是對於給定提示擬合高獎勵回應。該正則化規定在基礎模型上運行第一階段的RL以生成較不容易崩潰的策略初始化,然後使用獎勵獎金來放大訓練期間的自我校正。當應用於Gemini 1.0 Pro和1.5 Flash模型時,我們發現SCoRe在MATH和HumanEval基準測試中分別將基礎模型的自我校正性能提高了15.6%和9.1%,達到了最先進的自我校正性能水平。
English
Self-correction is a highly desirable capability of large language models
(LLMs), yet it has consistently been found to be largely ineffective in modern
LLMs. Existing approaches for training self-correction either require multiple
models or rely on a more capable model or other forms of supervision. To this
end, we develop a multi-turn online reinforcement learning (RL) approach,
SCoRe, that significantly improves an LLM's self-correction ability using
entirely self-generated data. To build SCoRe, we first show that variants of
supervised fine-tuning (SFT) on offline model-generated correction traces are
insufficient for instilling self-correction behavior. In particular, we observe
that training via SFT either suffers from a distribution mismatch between the
training data and the model's own responses or implicitly prefers only a
certain mode of correction behavior that is often not effective at test time.
SCoRe addresses these challenges by training under the model's own distribution
of self-generated correction traces and using appropriate regularization to
steer the learning process into learning a self-correction strategy that is
effective at test time as opposed to simply fitting high-reward responses for a
given prompt. This regularization prescribes running a first phase of RL on a
base model to generate a policy initialization that is less susceptible to
collapse and then using a reward bonus to amplify self-correction during
training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that
SCoRe achieves state-of-the-art self-correction performance, improving the base
models' self-correction by 15.6% and 9.1% respectively on the MATH and
HumanEval benchmarks.Summary
AI-Generated Summary