JanusFlow:調和自迴歸和矯正流以實現統一的多模態理解和生成
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
November 12, 2024
作者: Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan
cs.AI
摘要
我們提出了JanusFlow,一個強大的框架,將影像理解和生成統一在單一模型中。JanusFlow引入了一個極簡的架構,將自回歸語言模型與修正流結合,後者是生成建模中的最先進方法。我們的關鍵發現表明,修正流可以在大型語言模型框架內直接訓練,無需進行複雜的架構修改。為了進一步提高我們統一模型的性能,我們採用了兩個關鍵策略:(i) 解耦理解和生成編碼器,以及(ii) 在統一訓練期間對齊它們的表示。大量實驗表明,JanusFlow在各自領域的專門模型方面實現了可比或優越的性能,同時在標準基準測試中明顯優於現有的統一方法。這項工作代表了邁向更高效和多功能的視覺-語言模型的一步。
English
We present JanusFlow, a powerful framework that unifies image understanding
and generation in a single model. JanusFlow introduces a minimalist
architecture that integrates autoregressive language models with rectified
flow, a state-of-the-art method in generative modeling. Our key finding
demonstrates that rectified flow can be straightforwardly trained within the
large language model framework, eliminating the need for complex architectural
modifications. To further improve the performance of our unified model, we
adopt two key strategies: (i) decoupling the understanding and generation
encoders, and (ii) aligning their representations during unified training.
Extensive experiments show that JanusFlow achieves comparable or superior
performance to specialized models in their respective domains, while
significantly outperforming existing unified approaches across standard
benchmarks. This work represents a step toward more efficient and versatile
vision-language models.Summary
AI-Generated Summary