ChatPaper.aiChatPaper

SePPO:半策略偏好優化以達到擴散對齊。

SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

October 7, 2024
作者: Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Yu Dong, Christopher Brinton, Jiebo Luo
cs.AI

摘要

從人類反饋中學習的強化學習(RLHF)方法正逐漸成為微調擴散模型(DMs)以進行視覺生成的一種途徑。然而,常用的在策略方法受限於獎勵模型的泛化能力,而離策略方法則需要大量難以獲得的配對人類標註數據,特別是在視覺生成任務中。為了解決在策略和離策略RLHF的限制,我們提出了一種偏好優化方法,該方法通過對齊DMs與偏好而不依賴於獎勵模型或配對的人類標註數據。具體來說,我們引入了一種半策略偏好優化(SePPO)方法。SePPO利用先前的檢查點作為參考模型,同時使用它們生成在策略參考樣本,這些樣本取代了偏好對中的“輸掉的圖像”。這種方法使我們能夠僅使用離策略的“獲勝圖像”進行優化。此外,我們設計了一種參考模型選擇策略,擴展了在策略空間中的探索。值得注意的是,我們並不僅僅將參考樣本視為學習的負面示例。相反,我們設計了一種基於錨點的標準來評估參考樣本是否可能是獲勝或輸掉的圖像,使模型能夠有選擇地從生成的參考樣本中學習。這種方法減輕了由參考樣本質量不確定性引起的性能下降。我們在文本到圖像和文本到視頻基準測試中驗證了SePPO。SePPO在文本到圖像基準測試中超越了所有先前的方法,並在文本到視頻基準測試中表現出色。代碼將在https://github.com/DwanZhang-AI/SePPO 上發布。
English
Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace "losing images" in preference pairs. This approach allows us to optimize using only off-policy "winning images." Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks. Code will be released in https://github.com/DwanZhang-AI/SePPO.

Summary

AI-Generated Summary

PDF52November 16, 2024