ChatPaper.aiChatPaper

在強化學習中馴服LLM中的自信心:RLHF中的獎勵校準

Taming Overconfidence in LLMs: Reward Calibration in RLHF

October 13, 2024
作者: Jixuan Leng, Chengsong Huang, Banghua Zhu, Jiaxin Huang
cs.AI

摘要

語言模型校準指的是模型信心與其回應實際表現之間的一致性。先前的研究指出大型語言模型(LLMs)存在過度自信現象,並顯示使用從人類反饋中訓練的強化學習(RLHF)的LLMs在輸出概率上更為銳利時會表現出過度自信。然而,在本研究中,我們揭示了RLHF傾向於使模型在其回應中表達口頭上的過度自信。我們調查了這種過度自信的潛在原因,並證明了用於Proximal Policy Optimization(PPO)的獎勵模型存在固有偏向於高信心分數,而不考慮實際回應質量。基於這一洞察,我們提出了兩種PPO變體:PPO-M:具有校準獎勵建模的PPO和PPO-C:具有校準獎勵計算的PPO。PPO-M在獎勵模型訓練中整合了明確的信心分數,從而校準獎勵模型以更好地捕捉回應質量與口頭自信之間的一致性。PPO-C根據當前獎勵與過去獎勵移動平均值之間的差異調整PPO期間的獎勵分數。PPO-M和PPO-C都可以無縫集成到當前的PPO流程中,並且不需要額外的黃金標籤。我們在包括多選和開放式生成在內的六個不同數據集上對我們的方法進行評估,其中包括Llama3-8B和Mistral-7B。實驗結果表明,我們的兩種方法都能減少校準誤差,並保持與標準PPO相當的性能。我們進一步表明,它們不會影響模型在開放式對話環境中的能力。
English
Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) are overconfident with a more sharpened output probability, in this study, we reveal that RLHF tends to lead models to express verbalized overconfidence in their own responses. We investigate the underlying cause of this overconfidence and demonstrate that reward models used for Proximal Policy Optimization (PPO) exhibit inherent biases towards high-confidence scores regardless of the actual quality of responses. Building upon this insight, we propose two PPO variants: PPO-M: PPO with Calibrated Reward Modeling and PPO-C: PPO with Calibrated Reward Calculation. PPO-M integrates explicit confidence scores in reward model training, which calibrates reward models to better capture the alignment between response quality and verbalized confidence. PPO-C adjusts the reward score during PPO based on the difference between the current reward and the moving average of past rewards. Both PPO-M and PPO-C can be seamlessly integrated into the current PPO pipeline and do not require additional golden labels. We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experiment results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO. We further show that they do not compromise model capabilities in open-ended conversation settings.

Summary

AI-Generated Summary

PDF32November 16, 2024