通往無需引導的擴增實境視覺生成：透過條件對比對齊

摘要

無分類器引導（CFG）是增強視覺生成模型樣本質量的關鍵技術。然而，在自回歸（AR）多模態生成中，CFG引入了設計不一致性，使語言和視覺內容之間存在矛盾，違背了統一視覺AR不同模態的設計理念。受語言模型對齊方法的啟發，我們提出條件對比對齊（CCA）來促進無引導的AR視覺生成，實現高性能並分析其與引導抽樣方法的理論聯繫。與改變抽樣過程以實現理想抽樣分佈的引導方法不同，CCA直接微調預訓練模型以適應相同的分佈目標。實驗結果顯示，CCA可以顯著提升所有測試模型的無引導性能，僅需在預訓練數據集上進行一次微調（約佔預訓練時期的1\%），與引導抽樣方法不相上下。這在很大程度上消除了AR視覺生成中對引導抽樣的需求，並將抽樣成本降低了一半。此外，通過調整訓練參數，CCA可以在樣本多樣性和保真度之間實現權衡，類似於CFG。這在實驗中確認了語言目標對齊和視覺引導方法之間的強大理論聯繫，將兩個先前獨立的研究領域統一起來。代碼和模型權重：https://github.com/thu-ml/CCA。

English

Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by language model alignment methods, we propose Condition Contrastive Alignment (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning (sim 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.

通往無需引導的擴增實境視覺生成：透過條件對比對齊

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

摘要

Summary

Support

Support