從反思到完美:通過反思調優擴展文本到圖像擴散模型的推理時間優化
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
April 22, 2025
作者: Le Zhuo, Liangbing Zhao, Sayak Paul, Yue Liao, Renrui Zhang, Yi Xin, Peng Gao, Mohamed Elhoseiny, Hongsheng Li
cs.AI
摘要
近期,文本到圖像的擴散模型通過大規模擴展訓練數據和模型參數,在視覺品質上取得了令人印象深刻的成果,然而在處理複雜場景和細粒度細節時仍常顯吃力。受大型語言模型中湧現的自我反思能力啟發,我們提出了ReflectionFlow,這是一個推理時框架,使擴散模型能夠迭代地反思並精煉其輸出。ReflectionFlow引入了三個互補的推理時擴展維度:(1) 噪聲級別擴展以優化潛在初始化;(2) 提示級別擴展實現精確語義指導;以及最為顯著的(3) 反思級別擴展,它明確提供可操作的反思,以迭代評估並修正先前的生成結果。為了支持反思級別擴展,我們構建了GenRef,一個包含100萬個三元組的大規模數據集,每個三元組包含一條反思、一張有缺陷的圖像和一張增強後的圖像。利用這一數據集,我們在最先進的擴散變換器FLUX.1-dev上高效地進行了反思調優,通過在統一框架內聯合建模多模態輸入。實驗結果表明,ReflectionFlow顯著優於簡單的噪聲級別擴展方法,為在挑戰性任務上實現更高質量的圖像合成提供了一種可擴展且計算高效的解決方案。
English
Recent text-to-image diffusion models achieve impressive visual quality
through extensive scaling of training data and model parameters, yet they often
struggle with complex scenes and fine-grained details. Inspired by the
self-reflection capabilities emergent in large language models, we propose
ReflectionFlow, an inference-time framework enabling diffusion models to
iteratively reflect upon and refine their outputs. ReflectionFlow introduces
three complementary inference-time scaling axes: (1) noise-level scaling to
optimize latent initialization; (2) prompt-level scaling for precise semantic
guidance; and most notably, (3) reflection-level scaling, which explicitly
provides actionable reflections to iteratively assess and correct previous
generations. To facilitate reflection-level scaling, we construct GenRef, a
large-scale dataset comprising 1 million triplets, each containing a
reflection, a flawed image, and an enhanced image. Leveraging this dataset, we
efficiently perform reflection tuning on state-of-the-art diffusion
transformer, FLUX.1-dev, by jointly modeling multimodal inputs within a unified
framework. Experimental results show that ReflectionFlow significantly
outperforms naive noise-level scaling methods, offering a scalable and
compute-efficient solution toward higher-quality image synthesis on challenging
tasks.Summary
AI-Generated Summary