ILLUME+:通过双重视觉标记化与扩散优化实现统一多模态大模型的精进
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement
April 2, 2025
作者: Runhui Huang, Chunwei Wang, Junwei Yang, Guansong Lu, Yunlong Yuan, Jianhua Han, Lu Hou, Wei Zhang, Lanqing Hong, Hengshuang Zhao, Hang Xu
cs.AI
摘要
我们推出了ILLUME+,它通过双重视觉标记化和扩散解码器,提升了深度语义理解与高保真图像生成的能力。现有统一模型在同时处理理解、生成和编辑这三大基础能力时面临挑战。诸如Chameleon和EMU3等模型采用VQGAN进行图像离散化,但由于缺乏深度语义交互,在视觉理解任务上落后于LLaVA等专业模型。为解决此问题,LaViT和ILLUME引入了语义编码器进行标记化,但在图像编辑上因纹理保留不佳而受限。同时,Janus系列解耦了输入与输出图像表示,限制了其无缝处理交错图文理解与生成的能力。相比之下,ILLUME+创新性地引入了统一的双重视觉标记器DualViTok,它既保留了细粒度纹理又对齐了文本语义,并支持从粗到细的图像表示策略,以促进多模态理解与生成。此外,我们采用扩散模型作为图像解码器,以提升生成质量并实现高效超分辨率。ILLUME+在统一的多模态大语言模型(MLLM)中遵循连续输入、离散输出的方案,并采用渐进式训练流程,支持视觉标记器、MLLM及扩散解码器间的动态分辨率调整。这一设计使得ILLUME+能够在多样化任务中灵活高效地进行上下文感知的图像编辑与生成。ILLUME+(3B)在多模态理解、生成及编辑基准测试中,展现出与现有统一MLLMs及专业模型相媲美的性能。凭借其卓越表现,ILLUME+为未来多模态应用提供了可扩展且多功能的基础。项目页面:https://illume-unified-mllm.github.io/。
English
We present ILLUME+ that leverages dual visual tokenization and a diffusion
decoder to improve both deep semantic understanding and high-fidelity image
generation. Existing unified models have struggled to simultaneously handle the
three fundamental capabilities in a unified model: understanding, generation,
and editing. Models like Chameleon and EMU3 utilize VQGAN for image
discretization, due to the lack of deep semantic interaction, they lag behind
specialist models like LLaVA in visual understanding tasks. To mitigate this,
LaViT and ILLUME employ semantic encoders for tokenization, but they struggle
with image editing due to poor texture preservation. Meanwhile, Janus series
decouples the input and output image representation, limiting their abilities
to seamlessly handle interleaved image-text understanding and generation. In
contrast, ILLUME+ introduces a unified dual visual tokenizer, DualViTok, which
preserves both fine-grained textures and text-aligned semantics while enabling
a coarse-to-fine image representation strategy for multimodal understanding and
generation. Additionally, we employ a diffusion model as the image detokenizer
for enhanced generation quality and efficient super-resolution. ILLUME+ follows
a continuous-input, discrete-output scheme within the unified MLLM and adopts a
progressive training procedure that supports dynamic resolution across the
vision tokenizer, MLLM, and diffusion decoder. This design allows for flexible
and efficient context-aware image editing and generation across diverse tasks.
ILLUME+ (3B) exhibits competitive performance against existing unified MLLMs
and specialized models across multimodal understanding, generation, and editing
benchmarks. With its strong performance, ILLUME+ provides a scalable and
versatile foundation for future multimodal applications. Project Page:
https://illume-unified-mllm.github.io/.Summary
AI-Generated Summary