PopAlign：為了更全面的對齊而使對比模式多樣化

摘要

大型語言模型（LLMs）的對齊涉及訓練模型以偏好-對比輸出對來根據人類偏好調整其響應。為獲取這樣的對比對，傳統方法如RLHF和RLAIF依賴於有限的對比模式，例如變化的模型變體或解碼溫度。這種單一性導致兩個問題：（1）對齊不夠全面；因此（2）模型容易受到越獄攻擊的影響。為解決這些問題，我們研究如何構建更全面和多樣化的對比模式以增強偏好數據（RQ1），並驗證對比模式多樣化對模型對齊的影響（RQ2）。對於RQ1，我們提出了PopAlign，一個整合了跨提示、模型和管道級別的多樣化對比模式的框架，引入了六種不需要額外反饋標記程序的對比策略。關於RQ2，我們進行了深入的實驗，證明PopAlign明顯優於現有方法，從而實現更全面的對齊。

English

Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to jailbreaking attacks. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.

PopAlign：為了更全面的對齊而使對比模式多樣化

PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment

摘要

Summary

Support

Support