JanusFlow：調和自迴歸和矯正流以實現統一的多模態理解和生成

摘要

我們提出了JanusFlow，一個強大的框架，將影像理解和生成統一在單一模型中。JanusFlow引入了一個極簡的架構，將自回歸語言模型與修正流結合，後者是生成建模中的最先進方法。我們的關鍵發現表明，修正流可以在大型語言模型框架內直接訓練，無需進行複雜的架構修改。為了進一步提高我們統一模型的性能，我們採用了兩個關鍵策略：(i) 解耦理解和生成編碼器，以及(ii) 在統一訓練期間對齊它們的表示。大量實驗表明，JanusFlow在各自領域的專門模型方面實現了可比或優越的性能，同時在標準基準測試中明顯優於現有的統一方法。這項工作代表了邁向更高效和多功能的視覺-語言模型的一步。

English

We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.

JanusFlow：調和自迴歸和矯正流以實現統一的多模態理解和生成

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

摘要

Support