多模态表征对齐在图像生成中的应用:文本-图像交替控制比你想象的更简单
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
February 27, 2025
作者: Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang
cs.AI
摘要
在先进的文本到图像生成领域,正涌现出一系列统一框架,这些框架将诸如CLIP和T5等强大的文本编码器与扩散变换器(Diffusion Transformer)骨干网络相结合。尽管已有研究尝试通过附加条件(如边缘检测图和深度图)来控制输出图像,但一个能够实现任意文本-图像交错控制的综合框架仍然缺失。这一不足在尝试将多个图像的概念或视觉元素融合至生成过程中尤为明显。为弥补这一空白,我们进行了初步实验,结果表明大型多模态模型(LMMs)提供了一个有效的共享表示空间,其中图像与文本能够良好对齐,作为外部扩散模型的条件。基于这一发现,我们提出了Dream Engine,一个高效且统一的框架,专为图像生成模型中的任意文本-图像交错控制而设计。在SD3.5等强大的文本到图像模型基础上,我们通过整合如QwenVL等多功能多模态信息编码器,替换了原有的仅文本编码器。我们的方法采用两阶段训练范式,包括联合文本-图像对齐和多模态交错指令微调。实验证明,这一训练方法行之有效,在GenEval基准测试中获得了0.69的综合评分,与SD3.5和FLUX等顶尖文本到图像模型的性能相当。
English
The field of advanced text-to-image generation is witnessing the emergence of
unified frameworks that integrate powerful text encoders, such as CLIP and T5,
with Diffusion Transformer backbones. Although there have been efforts to
control output images with additional conditions, like canny and depth map, a
comprehensive framework for arbitrary text-image interleaved control is still
lacking. This gap is especially evident when attempting to merge concepts or
visual elements from multiple images in the generation process. To mitigate the
gap, we conducted preliminary experiments showing that large multimodal models
(LMMs) offer an effective shared representation space, where image and text can
be well-aligned to serve as a condition for external diffusion models. Based on
this discovery, we propose Dream Engine, an efficient and unified framework
designed for arbitrary text-image interleaved control in image generation
models. Building on powerful text-to-image models like SD3.5, we replace the
original text-only encoders by incorporating versatile multimodal information
encoders such as QwenVL. Our approach utilizes a two-stage training paradigm,
consisting of joint text-image alignment and multimodal interleaved instruction
tuning. Our experiments demonstrate that this training method is effective,
achieving a 0.69 overall score on the GenEval benchmark, and matching the
performance of state-of-the-art text-to-image models like SD3.5 and FLUX.Summary
AI-Generated Summary