SynerGen-VL:朝向具有視覺專家和標記折疊的協同圖像理解和生成
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
December 12, 2024
作者: Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai
cs.AI
摘要
大型語言模型(LLMs)的顯著成功已延伸至多模態領域,在圖像理解和生成方面取得傑出表現。最近努力發展統一的多模態大型語言模型(MLLMs),將這些能力整合在一起,已顯示出有前途的結果。然而,現有方法通常涉及模型架構或訓練流程中的複雜設計,增加了模型訓練和擴展的困難。在本文中,我們提出了SynerGen-VL,一種簡單而強大的無編碼器MLLM,能夠進行圖像理解和生成。為了應對現有無編碼器統一MLLM中確定的挑戰,我們引入了標記折疊機制和基於視覺專家的漸進對齊預訓練策略,有效支持高分辨率圖像理解,同時降低訓練複雜性。在大規模混合圖像-文本數據上訓練,使用統一的下一個標記預測目標後,SynerGen-VL實現或超越現有無編碼器統一MLLM的性能,並具有可比擬或更小的參數大小,縮小了與特定任務最先進模型之間的差距,突顯了未來統一MLLM的有前途的道路。我們的代碼和模型將會釋出。
English
The remarkable success of Large Language Models (LLMs) has extended to the
multimodal domain, achieving outstanding performance in image understanding and
generation. Recent efforts to develop unified Multimodal Large Language Models
(MLLMs) that integrate these capabilities have shown promising results.
However, existing approaches often involve complex designs in model
architecture or training pipeline, increasing the difficulty of model training
and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful
encoder-free MLLM capable of both image understanding and generation. To
address challenges identified in existing encoder-free unified MLLMs, we
introduce the token folding mechanism and the vision-expert-based progressive
alignment pretraining strategy, which effectively support high-resolution image
understanding while reducing training complexity. After being trained on
large-scale mixed image-text data with a unified next-token prediction
objective, SynerGen-VL achieves or surpasses the performance of existing
encoder-free unified MLLMs with comparable or smaller parameter sizes, and
narrows the gap with task-specific state-of-the-art models, highlighting a
promising path toward future unified MLLMs. Our code and models shall be
released.Summary
AI-Generated Summary