文本分段及学习其奖励以改善语言模型中的强化学习和自适应性。
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
January 6, 2025
作者: Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany Awadalla, Weizhu Chen, Mingyuan Zhou
cs.AI
摘要
人类反馈的强化学习(RLHF)被广泛应用于将语言模型(LMs)与人类偏好对齐。先前的RLHF工作通常采用赌博机制式的方法,尽管直观,却忽略了LM生成的序贯性质,并可能受到稀疏奖励问题的困扰。最近的研究提出了密集的标记级RLHF,将每个标记视为一个动作可能对适当的奖励分配过于微妙。在本文中,我们试图通过训练和利用一个段级奖励模型来兼顾二者,该模型为跨越短序列标记的每个语义完整文本段分配奖励。对于奖励学习,我们的方法允许动态文本分割,并与标准序列偏好数据集兼容。为了针对段奖励进行有效的基于RL的LM训练,我们将经典标量赌博奖励标准化器推广为位置感知标准化器函数,并对段奖励进行插值以进一步增加密集度。通过这些设计,我们的方法在LM策略的三个流行RLHF基准测试中表现出竞争力:AlpacaEval 2.0、Arena-Hard和MT-Bench。我们进行了消融研究以进一步展示我们的方法。
English
Reinforcement learning from human feedback (RLHF) has been widely adopted to
align language models (LMs) with human preference. Prior RLHF works typically
take a bandit formulation, which, though intuitive, ignores the sequential
nature of LM generation and can suffer from the sparse reward issue. While
recent works propose dense token-level RLHF, treating each token as an action
may be oversubtle to proper reward assignment. In this paper, we seek to get
the best of both by training and utilizing a segment-level reward model, which
assigns a reward to each semantically complete text segment that spans over a
short sequence of tokens. For reward learning, our method allows dynamic text
segmentation and compatibility with standard sequence-preference datasets. For
effective RL-based LM training against segment reward, we generalize the
classical scalar bandit reward normalizers into location-aware normalizer
functions and interpolate the segment reward for further densification. With
these designs, our method performs competitively on three popular RLHF
benchmarks for LM policy: AlpacaEval 2.0, Arena-Hard, and MT-Bench. Ablation
studies are conducted to further demonstrate our method.Summary
AI-Generated Summary