GraPE：一個用於組合式T2I合成的生成-規劃-編輯框架

摘要

文本到圖像（T2I）生成在擴散模型的幫助下取得了顯著進展，使得可以從文本提示生成逼真的圖像。儘管取得了這一進展，現有方法在遵循複雜文本提示方面仍然面臨挑戰，特別是那些需要組合和多步推理的提示。在面對這樣複雜的指令時，最先進的模型往往在忠實地建模物體屬性和它們之間的關係方面出現錯誤。在這項工作中，我們提出了一種用於T2I合成的替代範式，將複雜多步生成任務分解為三個步驟：（a）生成：我們首先使用現有的擴散模型生成圖像；（b）規劃：我們利用多模態LLM（MLLM）來識別生成圖像中表達為個別物體及其屬性的錯誤，並生成一系列糾正步驟，形成編輯計劃；（c）編輯：我們利用現有的文本引導圖像編輯模型來按照編輯計劃順序執行，以獲得符合原始指令的所需圖像。我們的方法之所以強大，是因為它具有模塊化的特性，無需訓練，並且可以應用於任何組合的圖像生成和編輯模型。作為一項額外的貢獻，我們還開發了一個能夠進行組合編輯的模型，進一步有助於改善我們提出的方法的整體準確性。我們的方法靈活地在推理時間計算和組合文本提示性能之間進行交易。我們在3個基準測試和10個T2I模型（包括DALLE-3和最新的SD-3.5-Large）上進行了廣泛的實驗評估。我們的方法不僅提高了SOTA模型的性能，最高提升了3個百分點，還縮小了較弱和較強模型之間的性能差距。

English

Text-to-image (T2I) generation has seen significant progress with diffusion models, enabling generation of photo-realistic images from text prompts. Despite this progress, existing methods still face challenges in following complex text prompts, especially those requiring compositional and multi-step reasoning. Given such complex instructions, SOTA models often make mistakes in faithfully modeling object attributes, and relationships among them. In this work, we present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps, (a) Generate: we first generate an image using existing diffusion models (b) Plan: we make use of Multi-Modal LLMs (MLLMs) to identify the mistakes in the generated image expressed in terms of individual objects and their properties, and produce a sequence of corrective steps required in the form of an edit-plan. (c) Edit: we make use of an existing text-guided image editing models to sequentially execute our edit-plan over the generated image to get the desired image which is faithful to the original instruction. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models. As an added contribution, we also develop a model capable of compositional editing, which further helps improve the overall accuracy of our proposed approach. Our method flexibly trades inference time compute with performance on compositional text prompts. We perform extensive experimental evaluation across 3 benchmarks and 10 T2I models including DALLE-3 and the latest -- SD-3.5-Large. Our approach not only improves the performance of the SOTA models, by upto 3 points, it also reduces the performance gap between weaker and stronger models. https://dair-iitd.github.io/GraPE/{https://dair-iitd.github.io/GraPE/}

GraPE：一個用於組合式T2I合成的生成-規劃-編輯框架

GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis

摘要

Support