LLM 的範例效率對齊
Sample-Efficient Alignment for LLMs
November 3, 2024
作者: Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, Min Lin
cs.AI
摘要
我們研究了在預算限制下給定人類反饋的情況下,如何有效地對齊大型語言模型(LLMs)與人類偏好的方法。我們首先在情境對決樂團的框架中制定了LLM對齊問題。這種制定方式包含了最近的範式,如在線RLHF和在線DPO,固有地追求包含在線主動探索的範例效率算法。利用樂團理論的見解,我們介紹了一種基於湯普森取樣的統一算法,並突出了其在兩種不同的LLM對齊場景中的應用。這種有效實現該算法的實用代理被命名為SEA(樣本效率對齊),通過在三個模型規模(1B、2.8B、6.9B)和三個偏好學習算法(DPO、IPO、SLiC)上進行大量實驗來進行實證驗證。結果表明,SEA實現了與oracle偏好高度樣本效率的對齊,優於最近的LLM主動探索方法。此外,我們釋放了SEA的實現,以及針對LLMs在線對齊設計的高效代碼庫,旨在加速該領域未來的研究。
English
We study methods for efficiently aligning large language models (LLMs) with
human preferences given budgeted online feedback. We first formulate the LLM
alignment problem in the frame of contextual dueling bandits. This formulation,
subsuming recent paradigms such as online RLHF and online DPO, inherently
quests for sample-efficient algorithms that incorporate online active
exploration. Leveraging insights from bandit theory, we introduce a unified
algorithm based on Thompson sampling and highlight its applications in two
distinct LLM alignment scenarios. The practical agent that efficiently
implements this algorithm, named SEA (Sample-Efficient Alignment), is
empirically validated through extensive experiments across three model scales
(1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The
results demonstrate that SEA achieves highly sample-efficient alignment with
oracle's preferences, outperforming recent active exploration methods for LLMs.
Additionally, we release the implementation of SEA together with an efficient
codebase designed for online alignment of LLMs, aiming to accelerate future
research in this field.Summary
AI-Generated Summary