從文字到像素的流暢轉換:跨模態演化的框架
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
December 19, 2024
作者: Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh
cs.AI
摘要
擴散模型及其泛化形式流匹配對於媒體生成領域產生了顯著影響。在這裡,傳統方法是從簡單的高斯噪聲源分佈學習到目標媒體分佈的複雜映射。對於跨模態任務,如文本到圖像生成,該映射從噪聲到圖像的學習同時在模型中包含一種條件機制。流匹配至今一個關鍵但相對未被探索的特點是,與擴散模型不同,它們並不受限於源分佈為噪聲。因此,在本文中,我們提出了一種範式轉移,並提出了一個問題,即我們是否可以訓練流匹配模型來直接從一種模態的分佈學習到另一種模態的分佈,從而消除了對於噪聲分佈和條件機制的需求。我們提出了一個通用且簡單的框架,CrossFlow,用於跨模態流匹配。我們展示了應用變分編碼器到輸入數據的重要性,並介紹了一種啟用無分類器指導的方法。令人驚訝的是,在文本到圖像的任務中,具有普通變壓器但沒有交叉關注的CrossFlow 稍微優於標準流匹配,我們展示了它隨著訓練步驟和模型大小的增加而更好地擴展,同時還允許有趣的潛在算術,這導致輸出空間中具有語義意義的編輯。為了展示我們方法的通用性,我們還展示了 CrossFlow 在各種跨模態/內模態映射任務中與當前最先進技術相當或更好,例如圖像標題生成、深度估計和圖像超分辨率。我們希望本文有助於加速跨模態媒體生成領域的進展。
English
Diffusion models, and their generalization, flow matching, have had a
remarkable impact on the field of media generation. Here, the conventional
approach is to learn the complex mapping from a simple source distribution of
Gaussian noise to the target media distribution. For cross-modal tasks such as
text-to-image generation, this same mapping from noise to image is learnt
whilst including a conditioning mechanism in the model. One key and thus far
relatively unexplored feature of flow matching is that, unlike Diffusion
models, they are not constrained for the source distribution to be noise.
Hence, in this paper, we propose a paradigm shift, and ask the question of
whether we can instead train flow matching models to learn a direct mapping
from the distribution of one modality to the distribution of another, thus
obviating the need for both the noise distribution and conditioning mechanism.
We present a general and simple framework, CrossFlow, for cross-modal flow
matching. We show the importance of applying Variational Encoders to the input
data, and introduce a method to enable Classifier-free guidance. Surprisingly,
for text-to-image, CrossFlow with a vanilla transformer without cross attention
slightly outperforms standard flow matching, and we show that it scales better
with training steps and model size, while also allowing for interesting latent
arithmetic which results in semantically meaningful edits in the output space.
To demonstrate the generalizability of our approach, we also show that
CrossFlow is on par with or outperforms the state-of-the-art for various
cross-modal / intra-modal mapping tasks, viz. image captioning, depth
estimation, and image super-resolution. We hope this paper contributes to
accelerating progress in cross-modal media generation.Summary
AI-Generated Summary