對文本進行分段並學習其獎勵,以改善語言模型中的強化學習和高效能。
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
January 6, 2025
作者: Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany Awadalla, Weizhu Chen, Mingyuan Zhou
cs.AI
摘要
從人類反饋中學習的強化學習(RLHF)已被廣泛應用於對齊語言模型(LMs)與人類偏好。先前的RLHF工作通常採用樂觀化的形式,儘管直觀,但忽略了LM生成的順序性,可能受到稀疏獎勵問題的困擾。最近的研究提出了密集的標記級RLHF,將每個標記視為一個動作可能對適當的獎勵分配過於微妙。在本文中,我們試圖通過訓練和利用一個段級獎勵模型來兼顧兩者,該模型為每個跨越短序列標記的語義完整文本段分配獎勵。對於獎勵學習,我們的方法允許動態文本分割並與標準序列偏好數據集兼容。為了有效地進行基於RL的LM訓練以應對段獎勵,我們將經典的樂觀標量獎勵歸一化器推廣為具有位置感知能力的歸一化器函數,並對段獎勵進行插值以進一步增加密度。通過這些設計,我們的方法在LM策略的三個流行的RLHF基準測試中表現出競爭力:AlpacaEval 2.0、Arena-Hard和MT-Bench。進行了消融研究以進一步展示我們的方法。
English
Reinforcement learning from human feedback (RLHF) has been widely adopted to
align language models (LMs) with human preference. Prior RLHF works typically
take a bandit formulation, which, though intuitive, ignores the sequential
nature of LM generation and can suffer from the sparse reward issue. While
recent works propose dense token-level RLHF, treating each token as an action
may be oversubtle to proper reward assignment. In this paper, we seek to get
the best of both by training and utilizing a segment-level reward model, which
assigns a reward to each semantically complete text segment that spans over a
short sequence of tokens. For reward learning, our method allows dynamic text
segmentation and compatibility with standard sequence-preference datasets. For
effective RL-based LM training against segment reward, we generalize the
classical scalar bandit reward normalizers into location-aware normalizer
functions and interpolate the segment reward for further densification. With
these designs, our method performs competitively on three popular RLHF
benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation
studies are conducted to further demonstrate our method.Summary
AI-Generated Summary