SynerGen-VL:实现视觉专家和标记折叠的协同图像理解与生成
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
December 12, 2024
作者: Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai
cs.AI
摘要
大型语言模型(LLMs)取得了显著成功,并扩展到多模态领域,在图像理解和生成方面表现出色。最近努力开发统一的多模态大型语言模型(MLLMs),将这些能力整合在一起,取得了令人鼓舞的成果。然而,现有方法通常涉及模型架构或训练流程中的复杂设计,增加了模型训练和扩展的难度。在本文中,我们提出了SynerGen-VL,一个简单而强大的无编码器MLLM,能够进行图像理解和生成。为了解决现有无编码器统一MLLM中存在的挑战,我们引入了标记折叠机制和基于视觉专家的渐进对齐预训练策略,有效支持高分辨率图像理解,同时降低训练复杂性。在大规模混合图像文本数据上进行训练,采用统一的下一个标记预测目标,SynerGen-VL实现或超越了现有无编码器统一MLLM的性能,参数规模相当或更小,并缩小了与特定任务最先进模型之间的差距,突显了未来统一MLLM的前景。我们将发布代码和模型。
English
The remarkable success of Large Language Models (LLMs) has extended to the
multimodal domain, achieving outstanding performance in image understanding and
generation. Recent efforts to develop unified Multimodal Large Language Models
(MLLMs) that integrate these capabilities have shown promising results.
However, existing approaches often involve complex designs in model
architecture or training pipeline, increasing the difficulty of model training
and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful
encoder-free MLLM capable of both image understanding and generation. To
address challenges identified in existing encoder-free unified MLLMs, we
introduce the token folding mechanism and the vision-expert-based progressive
alignment pretraining strategy, which effectively support high-resolution image
understanding while reducing training complexity. After being trained on
large-scale mixed image-text data with a unified next-token prediction
objective, SynerGen-VL achieves or surpasses the performance of existing
encoder-free unified MLLMs with comparable or smaller parameter sizes, and
narrows the gap with task-specific state-of-the-art models, highlighting a
promising path toward future unified MLLMs. Our code and models shall be
released.Summary
AI-Generated Summary