単語からピクセルへの流れ：クロスモダリティ進化のためのフレームワーク

要旨

拡散モデルおよびその一般化であるフローマッチングは、メディア生成分野に顕著な影響を与えてきました。従来のアプローチでは、単純なガウスノイズのソース分布からターゲットメディア分布への複雑なマッピングを学習することが一般的です。テキストから画像への生成などのクロスモーダルタスクでは、ノイズから画像への同じマッピングが学習され、モデルには条件付けメカニズムが組み込まれます。フローマッチングの重要な未開拓領域の1つは、拡散モデルとは異なり、ソース分布がノイズである必要がないことです。したがって、本論文では、パラダイムシフトを提案し、ノイズ分布と条件付けメカニズムの両方を不要とするために、代わりにクロスモーダル間の直接マッピングを学習するためにフローマッチングモデルをトレーニングできるかどうかという問いに取り組みます。クロスモーダルフローマッチングのための一般的でシンプルなフレームワークであるCrossFlowを提案します。入力データに変分エンコーダを適用する重要性を示し、クラシファイア不要のガイダンスを可能にする手法を紹介します。驚くべきことに、テキストから画像への場合、クロスアテンションのないバニラトランスフォーマーを使用したCrossFlowは、標準のフローマッチングよりもわずかに優れた性能を発揮し、トレーニングステップとモデルサイズのスケーリングがより良く、出力空間で意味のある編集を可能にします。アプローチの汎用性を示すために、CrossFlowが画像キャプショニング、深度推定、画像超解像などのさまざまなクロスモーダル/イントラモーダルマッピングタスクにおいて、最先端技術と同等またはそれ以上の性能を発揮することも示します。本論文がクロスモーダルメディア生成の進展を加速するのに役立つことを期待しています。

English

Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

単語からピクセルへの流れ：クロスモダリティ進化のためのフレームワーク

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

要旨

Summary

Support

Support