ChatPaper.aiChatPaper

我思故我扩散:实现扩散模型中的多模态上下文推理

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

February 12, 2025
作者: Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu
cs.AI

摘要

本文提出了ThinkDiff,一种新颖的对齐范式,通过整合视觉-语言模型(VLMs)的优势,赋予文本到图像扩散模型多模态上下文理解和推理能力。现有的多模态扩散微调方法主要侧重于像素级重建,而非上下文推理,并受限于基于推理的数据集的复杂性和有限可用性。ThinkDiff通过利用视觉-语言训练作为代理任务,将VLMs与编码器-解码器大型语言模型(LLM)的解码器对齐,而非扩散解码器。这个代理任务建立在这样一个观察基础上,即LLM解码器与使用相应LLM编码器进行提示嵌入的扩散解码器共享相同的输入特征空间。因此,通过与LLM解码器对齐,可以简化将VLMs与扩散解码器对齐的过程。在没有复杂训练和数据集的情况下,ThinkDiff有效释放了扩散模型中的理解、推理和组合能力。实验证明,ThinkDiff在具有挑战性的CoBSAT基准测试中,将多模态上下文推理生成的准确率从19.2%显著提高到46.3%,仅需在4个A100 GPU上进行5小时的训练。此外,ThinkDiff在将多个图像和文本组合成逻辑连贯图像方面表现出色。项目页面:https://mizhenxing.github.io/ThinkDiff。
English
This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the LLM decoder shares the same input feature space with diffusion decoders that use the corresponding LLM encoder for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.

Summary

AI-Generated Summary

PDF303February 18, 2025