ChatPaper.aiChatPaper

HermesFlow:无缝地弥合多模态理解与生成之间的差距

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

February 17, 2025
作者: Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui
cs.AI

摘要

自回归范式的显著成功在多模态大语言模型(MLLMs)方面取得了重大进展,强大的模型如Show-o、Transfusion和Emu3在统一图像理解和生成方面取得了显著进展。我们首次揭示了一个共同现象:MLLMs的理解能力通常比生成能力强,两者之间存在显著差距。基于这一发现,我们提出了HermesFlow,这是一个简单而通用的框架,旨在无缝地弥合MLLMs中理解和生成之间的差距。具体而言,我们以同源数据作为输入,筛选出理解和生成的同源偏好数据。通过Pair-DPO和自我博弈迭代优化,HermesFlow有效地利用同源偏好数据对多模态理解和生成进行对齐。大量实验表明,我们的方法明显优于先前的方法,特别是在缩小多模态理解和生成之间差距方面。这些发现突显了HermesFlow作为下一代多模态基础模型的通用对齐框架的潜力。 代码:https://github.com/Gen-Verse/HermesFlow
English
The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow

Summary

AI-Generated Summary

PDF162February 18, 2025