자누스: 통합된 다중 모달 이해와 생성을 위한 시각 인코딩 분리

초록

본 논문에서는 다중 모달 이해와 생성을 통합하는 자나스(Janus)라는 자기 회귀적 프레임워크를 소개합니다. 이전 연구들은 종종 Chameleon과 같은 단일 시각 인코더를 이용하여 두 작업에 모두 의존했습니다. 그러나 다중 모달 이해와 생성에 필요한 정보의 세분화 수준이 다르기 때문에, 이 접근 방식은 특히 다중 모달 이해에서 최적의 성능을 얻기 어려울 수 있습니다. 이 문제를 해결하기 위해 우리는 시각 인코딩을 별도의 경로로 분리하면서도 하나의 통합된 트랜스포머 아키텍처를 활용합니다. 이 분리는 시각 인코더의 역할 간의 충돌을 완화할 뿐만 아니라 프레임워크의 유연성을 향상시킵니다. 예를 들어, 다중 모달 이해 및 생성 구성 요소는 각각 가장 적합한 인코딩 방법을 독립적으로 선택할 수 있습니다. 실험 결과 자나스는 이전의 통합 모델을 능가하며 과업별 모델의 성능을 맞거나 능가함을 보여줍니다. 자나스의 간결함, 높은 유연성, 효과적인 성능은 다음 세대 통합 다중 모달 모델에 대한 강력한 후보로 만들어냅니다.

English

In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

자누스: 통합된 다중 모달 이해와 생성을 위한 시각 인코딩 분리

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

초록

Summary

Support