LLM 的範例效率對齊

摘要

我們研究了在預算限制下給定人類反饋的情況下，如何有效地對齊大型語言模型（LLMs）與人類偏好的方法。我們首先在情境對決樂團的框架中制定了LLM對齊問題。這種制定方式包含了最近的範式，如在線RLHF和在線DPO，固有地追求包含在線主動探索的範例效率算法。利用樂團理論的見解，我們介紹了一種基於湯普森取樣的統一算法，並突出了其在兩種不同的LLM對齊場景中的應用。這種有效實現該算法的實用代理被命名為SEA（樣本效率對齊），通過在三個模型規模（1B、2.8B、6.9B）和三個偏好學習算法（DPO、IPO、SLiC）上進行大量實驗來進行實證驗證。結果表明，SEA實現了與oracle偏好高度樣本效率的對齊，優於最近的LLM主動探索方法。此外，我們釋放了SEA的實現，以及針對LLMs在線對齊設計的高效代碼庫，旨在加速該領域未來的研究。

English

We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.

LLM 的範例效率對齊

Sample-Efficient Alignment for LLMs

摘要

Summary

Support

Support