ChatPaper.aiChatPaper

像素空間下的潛在擴散模型事後訓練

Pixel-Space Post-Training of Latent Diffusion Models

September 26, 2024
作者: Christina Zhang, Simran Motwani, Matthew Yu, Ji Hou, Felix Juefei-Xu, Sam Tsai, Peter Vajda, Zijian He, Jialiang Wang
cs.AI

摘要

潛在擴散模型(LDMs)近年來在圖像生成領域取得了顯著進展。LDMs的一個主要優勢在於它們能夠在壓縮的潛在空間中運作,這使得訓練和部署更加高效。然而,儘管具有這些優勢,LDMs仍然存在挑戰。例如,觀察到LDMs通常無法完美生成高頻細節和複雜組合。我們假設這些缺陷的一個原因是所有LDMs的預訓練和後訓練都是在潛在空間中進行的,而該空間通常比輸出圖像低8倍8的空間分辨率。為解決此問題,我們建議在後訓練過程中添加像素空間監督,以更好地保留高頻細節。實驗結果顯示,添加像素空間目標顯著改善了基於偏好的後訓練和有監督質量微調,並在視覺質量和視覺缺陷指標上大幅提升了最先進的DiT transformer和U-Net擴散模型,同時保持相同的文本對齊質量。
English
Latent diffusion models (LDMs) have made significant advancements in the field of image generation in recent years. One major advantage of LDMs is their ability to operate in a compressed latent space, allowing for more efficient training and deployment. However, despite these advantages, challenges with LDMs still remain. For example, it has been observed that LDMs often generate high-frequency details and complex compositions imperfectly. We hypothesize that one reason for these flaws is due to the fact that all pre- and post-training of LDMs are done in latent space, which is typically 8 times 8 lower spatial-resolution than the output images. To address this issue, we propose adding pixel-space supervision in the post-training process to better preserve high-frequency details. Experimentally, we show that adding a pixel-space objective significantly improves both supervised quality fine-tuning and preference-based post-training by a large margin on a state-of-the-art DiT transformer and U-Net diffusion models in both visual quality and visual flaw metrics, while maintaining the same text alignment quality.

Summary

AI-Generated Summary

PDF222November 16, 2024