LLM에서의 과신 극복: RLHF에서의 보상 보정

초록

언어 모델 보정은 모델의 확신과 실제 성능 간의 조정을 의미합니다. 이전 연구들은 대형 언어 모델(Large Language Models, LLMs)에서의 과신 현상을 지적하고 LLMs가 인간 피드백으로 강화 학습된 경우 더 날카로운 출력 확률로 과신하는 것을 보여주었습니다. 본 연구에서는 인간 피드백을 통한 강화 학습이 모델이 자신의 응답에서 언어적으로 과신을 표현하도록 이끄는 경향이 있다는 것을 밝혀냅니다. 이 과신의 근본적인 원인을 조사하고 Proximal Policy Optimization (PPO)에서 사용되는 보상 모델이 실제 응답의 품질과는 무관하게 높은 확신 점수를 향하는 내재적 편향을 나타내는 것을 입증합니다. 이 통찰을 기반으로 PPO 변형 두 가지를 제안합니다: PPO-M: 보정된 보상 모델링이 적용된 PPO와 PPO-C: 보정된 보상 계산이 적용된 PPO. PPO-M은 보상 모델 훈련에 명시적 확신 점수를 통합하여 응답 품질과 언어적 확신 간의 조정을 더 잘 포착하도록 보상 모델을 보정합니다. PPO-C는 PPO 중 현재 보상과 과거 보상의 이동 평균 간의 차이를 기반으로 보상 점수를 조정합니다. PPO-M과 PPO-C는 현재 PPO 파이프라인에 매끄럽게 통합될 수 있으며 추가적인 골든 레이블이 필요하지 않습니다. 우리의 방법을 다양한 데이터셋을 포함한 여섯 가지 다양한 데이터셋인 Llama3-8B와 Mistral-7B에서 평가합니다. 실험 결과는 우리의 두 가지 방법이 모두 보정 오차를 줄이고 표준 PPO와 비교 가능한 성능을 유지할 수 있다는 것을 보여줍니다. 또한 이들이 오픈엔드 대화 설정에서 모델 능력을 희생시키지 않음을 보여줍니다.

English

Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) are overconfident with a more sharpened output probability, in this study, we reveal that RLHF tends to lead models to express verbalized overconfidence in their own responses. We investigate the underlying cause of this overconfidence and demonstrate that reward models used for Proximal Policy Optimization (PPO) exhibit inherent biases towards high-confidence scores regardless of the actual quality of responses. Building upon this insight, we propose two PPO variants: PPO-M: PPO with Calibrated Reward Modeling and PPO-C: PPO with Calibrated Reward Calculation. PPO-M integrates explicit confidence scores in reward model training, which calibrates reward models to better capture the alignment between response quality and verbalized confidence. PPO-C adjusts the reward score during PPO based on the difference between the current reward and the moving average of past rewards. Both PPO-M and PPO-C can be seamlessly integrated into the current PPO pipeline and do not require additional golden labels. We evaluate our methods on both Llama3-8B and Mistral-7B across six diverse datasets including multiple-choice and open-ended generation. Experiment results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO. We further show that they do not compromise model capabilities in open-ended conversation settings.

LLM에서의 과신 극복: RLHF에서의 보상 보정

Taming Overconfidence in LLMs: Reward Calibration in RLHF

초록

Support