LLM的高效对齐方法

摘要

我们研究了在给定预算的在线反馈情况下，如何有效地将大型语言模型（LLMs）与人类偏好进行对齐的方法。我们首先在上下文对决型强盗的框架中对LLM对齐问题进行了规范化。这种规范化涵盖了最近的范式，如在线RLHF和在线DPO，固有地寻求结合在线主动探索的样本高效算法。借鉴强盗理论的见解，我们介绍了一种基于汤普森抽样的统一算法，并突出了其在两种不同的LLM对齐场景中的应用。这种有效实现该算法的实际代理被命名为SEA（Sample-Efficient Alignment），通过在三个模型规模（1B、2.8B、6.9B）和三种偏好学习算法（DPO、IPO、SLiC）上进行大量实验进行了经验验证。结果表明，SEA实现了高度样本高效的对齐，优于最近用于LLMs的主动探索方法。此外，我们发布了SEA的实现，以及专为LLMs在线对齐设计的高效代码库，旨在加速该领域未来的研究。

English

We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.

LLM的高效对齐方法

Sample-Efficient Alignment for LLMs

摘要

Summary

Support

Support