MIGE:基于多模态指令的图像生成与编辑统一框架
MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing
February 28, 2025
作者: Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, Huawei Shen
cs.AI
摘要
尽管基于扩散模型的图像生成取得了显著进展,但主题驱动的生成和基于指令的编辑仍然面临挑战。现有方法通常将这两者分开处理,受限于高质量数据的匮乏和泛化能力的不足。然而,这两项任务都需要在捕捉复杂视觉变化的同时,保持输入与输出之间的一致性。为此,我们提出了MIGE,一个利用多模态指令标准化任务表示的统一框架。它将主题驱动生成视为在空白画布上的创作,而将基于指令的编辑视为对现有图像的修改,从而建立了一个共享的输入-输出公式。MIGE引入了一种新颖的多模态编码器,将自由形式的多模态指令映射到一个统一的视觉-语言空间,通过特征融合机制整合视觉和语义特征。这种统一性使得两项任务能够联合训练,带来两大优势:(1) 跨任务增强:通过共享视觉和语义表示,联合训练提升了主题驱动生成和基于指令编辑中的指令遵循度和视觉一致性。(2) 泛化能力:在统一格式下学习促进了跨任务知识迁移,使MIGE能够泛化到包括基于指令的主题驱动编辑在内的新颖组合任务。实验表明,MIGE在主题驱动生成和基于指令的编辑上均表现出色,并在基于指令的主题驱动编辑这一新任务上树立了新的技术标杆。代码和模型已公开于https://github.com/Eureka-Maggie/MIGE。
English
Despite significant progress in diffusion-based image generation,
subject-driven generation and instruction-based editing remain challenging.
Existing methods typically treat them separately, struggling with limited
high-quality data and poor generalization. However, both tasks require
capturing complex visual variations while maintaining consistency between
inputs and outputs. Therefore, we propose MIGE, a unified framework that
standardizes task representations using multimodal instructions. It treats
subject-driven generation as creation on a blank canvas and instruction-based
editing as modification of an existing image, establishing a shared
input-output formulation. MIGE introduces a novel multimodal encoder that maps
free-form multimodal instructions into a unified vision-language space,
integrating visual and semantic features through a feature fusion
mechanism.This unification enables joint training of both tasks, providing two
key advantages: (1) Cross-Task Enhancement: By leveraging shared visual and
semantic representations, joint training improves instruction adherence and
visual consistency in both subject-driven generation and instruction-based
editing. (2) Generalization: Learning in a unified format facilitates
cross-task knowledge transfer, enabling MIGE to generalize to novel
compositional tasks, including instruction-based subject-driven editing.
Experiments show that MIGE excels in both subject-driven generation and
instruction-based editing while setting a state-of-the-art in the new task of
instruction-based subject-driven editing. Code and model have been publicly
available at https://github.com/Eureka-Maggie/MIGE.Summary
AI-Generated Summary