从文字到像素的流动:跨模态演化的框架

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

December 19, 2024
作者: Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, Mannat Singh
cs.AI

摘要

扩散模型及其泛化形式流匹配在媒体生成领域产生了显著影响。在这里,传统方法是学习从简单的高斯噪声源分布到目标媒体分布的复杂映射。对于诸如文本到图像生成之类的跨模态任务,该映射从噪声到图像的学习同时包括模型中的调节机制。流匹配迄今一个关键但相对未被探索的特征是,与扩散模型不同,它们不受限于源分布为噪声。因此,在本文中,我们提出了一个范式转变,并提出了一个问题,即我们是否可以训练流匹配模型来直接学习从一种模态的分布到另一种模态的分布的映射,从而消除了噪声分布和调节机制的需要。我们提出了一个通用且简单的框架CrossFlow,用于跨模态流匹配。我们展示了应用变分编码器到输入数据的重要性,并引入了一种方法来实现无分类器指导。令人惊讶的是,在文本到图像的任务中,具有普通变压器但无交叉注意力的CrossFlow略优于标准流匹配,我们展示了它在训练步骤和模型规模上的更好扩展性,同时还允许在输出空间中进行有趣的潜在算术,从而产生语义上有意义的编辑。为了展示我们方法的泛化能力,我们还展示CrossFlow在各种跨模态/内模态映射任务中,如图像字幕生成、深度估计和图像超分辨率方面与最先进技术持平或优于其表现。我们希望本文有助于加速跨模态媒体生成领域的进展。
English
Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.

Summary

AI-Generated Summary

PDF264December 20, 2024