JanusFlow: 자기회귀와 정류된 플로우를 조화시켜 통합된 다중모달 이해와 생성

초록

우리는 이미지 이해와 생성을 하나의 모델에서 통합하는 강력한 프레임워크 인 JanusFlow를 제안합니다. JanusFlow는 생성 모델링의 최첨단 방법인 정정된 플로우를 자동 회귀 언어 모델과 통합하는 미니멀한 아키텍처를 소개합니다. 우리의 주요 발견은 정정된 플로우가 대형 언어 모델 프레임워크 내에서 간단하게 훈련될 수 있음을 보여주며, 복잡한 구조적 수정이 필요하지 않음을 입증합니다. 통합 모델의 성능을 더욱 향상시키기 위해 두 가지 주요 전략을 채택합니다: (i) 이해 및 생성 인코더의 분리, (ii) 통합된 훈련 중에 그들의 표현을 일치시키는 것. 광범위한 실험 결과, JanusFlow가 각각의 도메인에서 특화된 모델에 비해 비슷하거나 우수한 성능을 달성하면서, 표준 벤치마크에서 기존의 통합된 접근법을 크게 능가한다는 것을 보여줍니다. 이 작업은 더 효율적이고 다재다능한 비전-언어 모델로 나아가는 한 걸음을 나타냅니다.

English

We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.

JanusFlow: 자기회귀와 정류된 플로우를 조화시켜 통합된 다중모달 이해와 생성

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

초록

Summary

Support