LLM的高效对齐方法
Sample-Efficient Alignment for LLMs
November 3, 2024
作者: Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, Min Lin
cs.AI
摘要
我们研究了在给定预算的在线反馈情况下,如何有效地将大型语言模型(LLMs)与人类偏好进行对齐的方法。我们首先在上下文对决型强盗的框架中对LLM对齐问题进行了规范化。这种规范化涵盖了最近的范式,如在线RLHF和在线DPO,固有地寻求结合在线主动探索的样本高效算法。借鉴强盗理论的见解,我们介绍了一种基于汤普森抽样的统一算法,并突出了其在两种不同的LLM对齐场景中的应用。这种有效实现该算法的实际代理被命名为SEA(Sample-Efficient Alignment),通过在三个模型规模(1B、2.8B、6.9B)和三种偏好学习算法(DPO、IPO、SLiC)上进行大量实验进行了经验验证。结果表明,SEA实现了高度样本高效的对齐,优于最近用于LLMs的主动探索方法。此外,我们发布了SEA的实现,以及专为LLMs在线对齐设计的高效代码库,旨在加速该领域未来的研究。
English
We study methods for efficiently aligning large language models (LLMs) with
human preferences given budgeted online feedback. We first formulate the LLM
alignment problem in the frame of contextual dueling bandits. This formulation,
subsuming recent paradigms such as online RLHF and online DPO, inherently
quests for sample-efficient algorithms that incorporate online active
exploration. Leveraging insights from bandit theory, we introduce a unified
algorithm based on Thompson sampling and highlight its applications in two
distinct LLM alignment scenarios. The practical agent that efficiently
implements this algorithm, named SEA (Sample-Efficient Alignment), is
empirically validated through extensive experiments across three model scales
(1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The
results demonstrate that SEA achieves highly sample-efficient alignment with
oracle's preferences, outperforming recent active exploration methods for LLMs.
Additionally, we release the implementation of SEA together with an efficient
codebase designed for online alignment of LLMs, aiming to accelerate future
research in this field.Summary
AI-Generated Summary