提升文字到圖像擴散模型的長文本對齊

摘要

隨著文本到圖像（T2I）擴散模型的快速發展，它們能夠從給定的文本生成前所未有的結果。然而，隨著文本輸入變得更長，現有的編碼方法如CLIP面臨限制，並且將生成的圖像與長文本對齊變得具有挑戰性。為了應對這些問題，我們提出了LongAlign，其中包括一種用於處理長文本的分段級別編碼方法，以及一種用於有效對齊訓練的分解偏好優化方法。對於分段級別編碼，將長文本劃分為多個段落並分別處理。該方法克服了預訓練編碼模型的最大輸入長度限制。對於偏好優化，我們提供了基於CLIP的分解偏好模型來微調擴散模型。具體來說，為了利用基於CLIP的偏好模型進行T2I對齊，我們深入研究了它們的評分機制，發現偏好分數可以分解為兩個部分：一個衡量T2I對齊的文本相關部分和一個評估人類偏好的其他視覺方面的文本無關部分。此外，我們發現文本無關部分在微調過程中導致常見的過度擬合問題。為了解決這個問題，我們提出了一種重新加權策略，為這兩個部分分配不同的權重，從而減少過度擬合並增強對齊效果。通過使用我們的方法對512次512 Stable Diffusion（SD）v1.5進行約20小時的微調後，微調後的SD在T2I對齊方面優於較強的基礎模型，如PixArt-alpha和Kandinsky v2.2。代碼可在https://github.com/luping-liu/LongAlign找到。

English

The rapid advancement of text-to-image (T2I) diffusion models has enabled them to generate unprecedented results from given texts. However, as text inputs become longer, existing encoding methods like CLIP face limitations, and aligning the generated images with long texts becomes challenging. To tackle these issues, we propose LongAlign, which includes a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For segment-level encoding, long texts are divided into multiple segments and processed separately. This method overcomes the maximum input length limits of pretrained encoding models. For preference optimization, we provide decomposed CLIP-based preference models to fine-tune diffusion models. Specifically, to utilize CLIP-based preference models for T2I alignment, we delve into their scoring mechanisms and find that the preference scores can be decomposed into two components: a text-relevant part that measures T2I alignment and a text-irrelevant part that assesses other visual aspects of human preference. Additionally, we find that the text-irrelevant part contributes to a common overfitting problem during fine-tuning. To address this, we propose a reweighting strategy that assigns different weights to these two components, thereby reducing overfitting and enhancing alignment. After fine-tuning 512 times 512 Stable Diffusion (SD) v1.5 for about 20 hours using our method, the fine-tuned SD outperforms stronger foundation models in T2I alignment, such as PixArt-alpha and Kandinsky v2.2. The code is available at https://github.com/luping-liu/LongAlign.

提升文字到圖像擴散模型的長文本對齊

Improving Long-Text Alignment for Text-to-Image Diffusion Models

摘要

Summary

Support

Support