PopAlign:為了更全面的對齊而使對比模式多樣化
PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment
October 17, 2024
作者: Zekun Moore Wang, Shawn Wang, Kang Zhu, Jiaheng Liu, Ke Xu, Jie Fu, Wangchunshu Zhou, Wenhao Huang
cs.AI
摘要
大型語言模型(LLMs)的對齊涉及訓練模型以偏好-對比輸出對來根據人類偏好調整其響應。為獲取這樣的對比對,傳統方法如RLHF和RLAIF依賴於有限的對比模式,例如變化的模型變體或解碼溫度。這種單一性導致兩個問題:(1)對齊不夠全面;因此(2)模型容易受到越獄攻擊的影響。為解決這些問題,我們研究如何構建更全面和多樣化的對比模式以增強偏好數據(RQ1),並驗證對比模式多樣化對模型對齊的影響(RQ2)。對於RQ1,我們提出了PopAlign,一個整合了跨提示、模型和管道級別的多樣化對比模式的框架,引入了六種不需要額外反饋標記程序的對比策略。關於RQ2,我們進行了深入的實驗,證明PopAlign明顯優於現有方法,從而實現更全面的對齊。
English
Alignment of large language models (LLMs) involves training models on
preference-contrastive output pairs to adjust their responses according to
human preferences. To obtain such contrastive pairs, traditional methods like
RLHF and RLAIF rely on limited contrasting patterns, such as varying model
variants or decoding temperatures. This singularity leads to two issues: (1)
alignment is not comprehensive; and thereby (2) models are susceptible to
jailbreaking attacks. To address these issues, we investigate how to construct
more comprehensive and diversified contrasting patterns to enhance preference
data (RQ1) and verify the impact of the diversification of contrasting patterns
on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that
integrates diversified contrasting patterns across the prompt, model, and
pipeline levels, introducing six contrasting strategies that do not require
additional feedback labeling procedures. Regarding RQ2, we conduct thorough
experiments demonstrating that PopAlign significantly outperforms existing
methods, leading to more comprehensive alignment.Summary
AI-Generated Summary