희소 특징 수준 제약을 활용한 직접 선호도 최적화

초록

대형 언어 모델 (LLM)을 인간의 선호와 조정하는 것은 여전히 주요 도전 과제입니다. 강화 학습을 통한 인간 피드백 (RLHF) 및 직접 선호 최적화 (DPO)와 같은 사후 훈련 기술은 주목할만한 성과를 거두었지만, 종종 계산 효율성과 훈련 불안정성을 도입합니다. 본 논문에서는 안정성을 보장하면서 정련 과정을 간소화하기 위해 고안된 새로운 방법인 Feature-level constrained Preference Optimization (FPO)을 제안합니다. FPO는 사전 훈련된 희소 오토인코더 (SAE)를 활용하고 특징 수준 제약 조건을 도입하여 효율적이고 희소성이 강제된 조정을 가능하게 합니다. 우리의 방법은 훈련이 잘 된 희소 오토인코더에서 활성화된 희소 특징을 사용함으로써 효율성을 누리며, 특징 수준 오프라인 참조를 사용하여 순차적 KL 발산의 품질을 활용합니다. 벤치마크 데이터셋에서의 실험 결과는 FPO가 최첨단 기준선과 비교하여 훨씬 낮은 계산 비용으로 승률에서 5.08%의 절대적인 향상을 달성함을 보여주며, 효율적이고 제어 가능한 LLM 조정을 위한 유망한 솔루션이 됨을 입증합니다.

English

The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.

희소 특징 수준 제약을 활용한 직접 선호도 최적화

Direct Preference Optimization Using Sparse Feature-Level Constraints

초록

Summary

Support