ChatPaper.aiChatPaper

GoT:释放多模态大语言模型的推理能力,助力视觉生成与编辑

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

March 13, 2025
作者: Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li
cs.AI

摘要

当前的图像生成与编辑方法主要将文本提示作为直接输入进行处理,缺乏对视觉构图和显式操作的推理。我们提出了生成思维链(Generation Chain-of-Thought, GoT),这是一种新颖的范式,通过在输出图像前进行显式的语言推理过程来实现生成与编辑。该方法将传统的文本到图像生成与编辑转变为一种推理引导的框架,能够分析语义关系与空间布局。我们定义了GoT的公式化表达,并构建了包含超过900万样本的大规模GoT数据集,这些样本带有详细捕捉语义-空间关系的推理链。为了充分利用GoT的优势,我们实现了一个统一框架,该框架集成了Qwen2.5-VL用于推理链生成,并结合了一个通过我们新提出的语义-空间引导模块增强的端到端扩散模型。实验表明,我们的GoT框架在生成与编辑任务上均表现出色,相较于基线方法有显著提升。此外,我们的方法支持交互式视觉生成,允许用户显式修改推理步骤以实现精确的图像调整。GoT开创了推理驱动的视觉生成与编辑新方向,生成的图像更符合人类意图。为了促进未来研究,我们在https://github.com/rongyaofang/GoT公开了数据集、代码及预训练模型。
English
Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/rongyaofang/GoT.

Summary

AI-Generated Summary

PDF211March 14, 2025