在大型語言模型中的獎勵韌性強化學習
Reward-Robust RLHF in LLMs
September 18, 2024
作者: Yuzi Yan, Xingzhou Lou, Jialian Li, Yiping Zhang, Jian Xie, Chao Yu, Yu Wang, Dong Yan, Yuan Shen
cs.AI
摘要
隨著大型語言模型(LLMs)不斷朝著更先進的智能形式發展,從人類反饋中進行強化學習(RLHF)越來越被視為實現人工通用智能(AGI)的關鍵途徑。然而,基於獎勵模型(RM)的對齊方法的依賴引入了重大挑戰,這是由於獎勵模型(RMs)固有的不穩定性和缺陷可能導致關鍵問題,如獎勵破解和與人類意圖不一致。在本文中,我們介紹了一個旨在應對這些基本挑戰的獎勵魯棒的RLHF框架,為LLMs中更可靠且更具韌性的學習鋪平道路。我們的方法引入了一個新穎的優化目標,通過整合貝葉斯獎勵模型集成(BRME)來平衡性能和魯棒性,以建模獎勵函數的不確定性集。這使得框架能夠整合名義性能和最小獎勵信號,確保即使存在不完美的獎勵模型,學習也更穩定。實證結果表明,我們的框架在各種基準測試中始終優於傳統的RLHF,表現出更高的準確性和長期穩定性。我們還提供了一個理論分析,證明了獎勵魯棒的RLHF方法接近恆定獎勵設置的穩定性,在隨機案例分析中證明其有效性。這些貢獻共同突顯了該框架提升LLMs與RLHF對齊的性能和穩定性潛力。
English
As Large Language Models (LLMs) continue to progress toward more advanced
forms of intelligence, Reinforcement Learning from Human Feedback (RLHF) is
increasingly seen as a key pathway toward achieving Artificial General
Intelligence (AGI). However, the reliance on reward-model-based (RM-based)
alignment methods introduces significant challenges due to the inherent
instability and imperfections of Reward Models (RMs), which can lead to
critical issues such as reward hacking and misalignment with human intentions.
In this paper, we introduce a reward-robust RLHF framework aimed at addressing
these fundamental challenges, paving the way for more reliable and resilient
learning in LLMs. Our approach introduces a novel optimization objective that
carefully balances performance and robustness by incorporating Bayesian Reward
Model Ensembles (BRME) to model the uncertainty set of reward functions. This
allows the framework to integrate both nominal performance and minimum reward
signals, ensuring more stable learning even with imperfect reward models.
Empirical results demonstrate that our framework consistently outperforms
traditional RLHF across diverse benchmarks, showing improved accuracy and
long-term stability. We also provide a theoretical analysis, demonstrating that
reward-robust RLHF approaches the stability of constant reward settings, which
proves to be effective in a stochastic-case analysis. Together, these
contributions highlight the framework potential to enhance both the performance
and stability of LLM alignment with RLHF.Summary
AI-Generated Summary