Janus：解耦視覺編碼以實現統一的多模態理解與生成

摘要

本文介紹了Janus，一個統一的自回歸框架，用於多模態理解和生成。先前的研究通常依賴於單一的視覺編碼器來處理這兩個任務，如Chameleon。然而，由於多模態理解和生成需要不同細節層次的信息，這種方法可能導致性能不佳，尤其在多模態理解方面。為了解決這個問題，我們將視覺編碼分解為獨立的路徑，同時仍然利用單一統一的Transformer架構進行處理。這種分解不僅減輕了視覺編碼器在理解和生成中角色之間的衝突，還增強了框架的靈活性。例如，多模態理解和生成組件都可以獨立選擇最適合的編碼方法。實驗表明，Janus超越了先前的統一模型，並且與任務特定模型的性能相匹敵或超越。Janus的簡單性、高靈活性和有效性使其成為下一代統一多模態模型的強勢候選。

English

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

Janus：解耦視覺編碼以實現統一的多模態理解與生成

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

摘要

Summary

Support

Support