利用稀疏特徵層級限制進行直接偏好優化

摘要

大型語言模型（LLMs）與人類偏好的對齊仍然是一個關鍵挑戰。雖然像是從人類反饋中進行強化學習（RLHF）和直接偏好優化（DPO）等訓練後技術已經取得顯著成功，但它們通常會引入計算效率低和訓練不穩定的問題。在本文中，我們提出了基於特徵約束的偏好優化（FPO），這是一種旨在簡化對齊過程並確保穩定性的新方法。FPO利用預先訓練的稀疏自編碼器（SAEs）並引入特徵級約束，從而實現高效、稀疏強制對齊。我們的方法通過使用在訓練良好的稀疏自編碼器中激活的稀疏特徵以及使用特徵級離線參考的順序KL散度，實現了效率。在基準數據集上的實驗結果表明，與最先進的基準相比，FPO在勝率上實現了5.08％的絕對改善，並且計算成本更低，這使其成為一種有效且可控的LLM對齊的有前途的解決方案。

English

The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.

利用稀疏特徵層級限制進行直接偏好優化

Direct Preference Optimization Using Sparse Feature-Level Constraints

摘要

Support