단어에서 픽셀로의 흐름: 다중 모달리티 진화를 위한 프레임워크

초록

확산 모델 및 이를 일반화한 Flow Matching은 미디어 생성 분야에 높은 영향을 미쳤습니다. 여기서 전통적인 접근 방식은 가우시안 노이즈의 간단한 소스 분포에서 대상 미디어 분포로의 복잡한 매핑을 학습하는 것입니다. 텍스트에서 이미지로의 생성과 같은 교모달 작업의 경우, 노이즈에서 이미지로의 동일한 매핑을 학습하면서 모델에 조건 부여 메커니즘을 포함합니다. Flow Matching의 중요하고 현재까지 상대적으로 탐구되지 않은 특징 중 하나는, 확산 모델과는 달리 소스 분포가 노이즈로 제한되지 않는다는 것입니다. 따라서 본 논문에서는 패러다임 변화를 제안하고, 한 모달리티의 분포에서 다른 모달리티의 분포로 직접 매핑을 학습할 수 있는지에 대한 질문을 제기합니다. 이를 통해 노이즈 분포와 조건 부여 메커니즘 모두를 필요로하지 않도록 Flow Matching 모델을 훈련할 수 있는 CrossFlow라는 일반적이고 간단한 프레임워크를 제안합니다. 입력 데이터에 변분 인코더를 적용하는 중요성을 보여주고, 분류기 없는 가이드를 가능하게 하는 방법을 소개합니다. 놀랍게도, 텍스트에서 이미지로의 경우, 교차 어텐션 없는 바닐라 트랜스포머를 사용한 CrossFlow가 표준 Flow Matching보다 성능이 약간 우수하며, 훈련 단계와 모델 크기에 더 잘 확장되며, 출력 공간에서 의미 있는 편집을 가능하게 하는 흥미로운 잠재 산술을 허용합니다. 접근 방법의 일반화 가능성을 입증하기 위해, CrossFlow가 이미지 캡션, 깊이 추정 및 이미지 초해상화와 같은 다양한 교모달 / 내모달 매핑 작업에서 최첨단 기술과 어느 정도 동등하거나 능가함을 보여줍니다. 본 논문이 교모달 미디어 생성 분야의 진전을 가속화하는 데 기여할 것으로 기대합니다.

English

Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

단어에서 픽셀로의 흐름: 다중 모달리티 진화를 위한 프레임워크

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

초록

Summary

Support

Support