提升文字到圖像擴散模型的長文本對齊
Improving Long-Text Alignment for Text-to-Image Diffusion Models
October 15, 2024
作者: Luping Liu, Chao Du, Tianyu Pang, Zehan Wang, Chongxuan Li, Dong Xu
cs.AI
摘要
隨著文本到圖像(T2I)擴散模型的快速發展,它們能夠從給定的文本生成前所未有的結果。然而,隨著文本輸入變得更長,現有的編碼方法如CLIP面臨限制,並且將生成的圖像與長文本對齊變得具有挑戰性。為了應對這些問題,我們提出了LongAlign,其中包括一種用於處理長文本的分段級別編碼方法,以及一種用於有效對齊訓練的分解偏好優化方法。對於分段級別編碼,將長文本劃分為多個段落並分別處理。該方法克服了預訓練編碼模型的最大輸入長度限制。對於偏好優化,我們提供了基於CLIP的分解偏好模型來微調擴散模型。具體來說,為了利用基於CLIP的偏好模型進行T2I對齊,我們深入研究了它們的評分機制,發現偏好分數可以分解為兩個部分:一個衡量T2I對齊的文本相關部分和一個評估人類偏好的其他視覺方面的文本無關部分。此外,我們發現文本無關部分在微調過程中導致常見的過度擬合問題。為了解決這個問題,我們提出了一種重新加權策略,為這兩個部分分配不同的權重,從而減少過度擬合並增強對齊效果。通過使用我們的方法對512次512 Stable Diffusion(SD)v1.5進行約20小時的微調後,微調後的SD在T2I對齊方面優於較強的基礎模型,如PixArt-alpha和Kandinsky v2.2。代碼可在https://github.com/luping-liu/LongAlign找到。
English
The rapid advancement of text-to-image (T2I) diffusion models has enabled
them to generate unprecedented results from given texts. However, as text
inputs become longer, existing encoding methods like CLIP face limitations, and
aligning the generated images with long texts becomes challenging. To tackle
these issues, we propose LongAlign, which includes a segment-level encoding
method for processing long texts and a decomposed preference optimization
method for effective alignment training. For segment-level encoding, long texts
are divided into multiple segments and processed separately. This method
overcomes the maximum input length limits of pretrained encoding models. For
preference optimization, we provide decomposed CLIP-based preference models to
fine-tune diffusion models. Specifically, to utilize CLIP-based preference
models for T2I alignment, we delve into their scoring mechanisms and find that
the preference scores can be decomposed into two components: a text-relevant
part that measures T2I alignment and a text-irrelevant part that assesses other
visual aspects of human preference. Additionally, we find that the
text-irrelevant part contributes to a common overfitting problem during
fine-tuning. To address this, we propose a reweighting strategy that assigns
different weights to these two components, thereby reducing overfitting and
enhancing alignment. After fine-tuning 512 times 512 Stable Diffusion (SD)
v1.5 for about 20 hours using our method, the fine-tuned SD outperforms
stronger foundation models in T2I alignment, such as PixArt-alpha and
Kandinsky v2.2. The code is available at
https://github.com/luping-liu/LongAlign.Summary
AI-Generated Summary