OmniFlow:多模态矫正流实现任意生成
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
December 2, 2024
作者: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover
cs.AI
摘要
我们介绍了OmniFlow,这是一种新颖的生成模型,专为文本到图像、文本到音频以及音频到图像等任意生成任务而设计。OmniFlow在文本到图像模型中使用的修正流(RF)框架上取得了进展,以处理多模态的联合分布。它在各种任务上均优于先前的任意生成模型,如文本到图像和文本到音频合成。我们的工作提供了三个关键贡献:首先,我们将RF扩展到多模态设置,并引入了一种新颖的引导机制,使用户能够灵活控制生成输出中不同模态之间的对齐。其次,我们提出了一种新颖的架构,扩展了Stable Diffusion 3的文本到图像MMDiT架构,实现了音频和文本的生成。这些扩展模块可以高效地单独进行预训练,并与基本文本到图像MMDiT合并以进行微调。最后,我们对大规模音频和文本生成的修正流变压器的设计选择进行了全面研究,为优化跨多种模态的性能提供了宝贵的见解。代码将在https://github.com/jacklishufan/OmniFlows 上提供。
English
We introduce OmniFlow, a novel generative model designed for any-to-any
generation tasks such as text-to-image, text-to-audio, and audio-to-image
synthesis. OmniFlow advances the rectified flow (RF) framework used in
text-to-image models to handle the joint distribution of multiple modalities.
It outperforms previous any-to-any models on a wide range of tasks, such as
text-to-image and text-to-audio synthesis. Our work offers three key
contributions: First, we extend RF to a multi-modal setting and introduce a
novel guidance mechanism, enabling users to flexibly control the alignment
between different modalities in the generated outputs. Second, we propose a
novel architecture that extends the text-to-image MMDiT architecture of Stable
Diffusion 3 and enables audio and text generation. The extended modules can be
efficiently pretrained individually and merged with the vanilla text-to-image
MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design
choices of rectified flow transformers for large-scale audio and text
generation, providing valuable insights into optimizing performance across
diverse modalities. The Code will be available at
https://github.com/jacklishufan/OmniFlows.Summary
AI-Generated Summary