OmniFlow:多模態矯正流實現任意生成

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

December 2, 2024
作者: Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover
cs.AI

摘要

我們介紹了 OmniFlow,一種新穎的生成模型,專為任務間的生成(any-to-any generation)設計,如文本到圖像、文本到音訊和音訊到圖像的合成。OmniFlow 在處理多模態聯合分佈時推進了用於文本到圖像模型的修正流(RF)框架。它在各種任務上均優於先前的任務間生成模型,如文本到圖像和文本到音訊的合成。我們的工作提供了三個關鍵貢獻:首先,我們將 RF 擴展到多模態設置,並引入一個新穎的引導機制,使用戶能夠靈活控制生成輸出中不同模態之間的對齊。其次,我們提出了一種新穎的架構,擴展了 Stable Diffusion 3 的文本到圖像 MMDiT 架構,並實現了音訊和文本生成。這些擴展模組可以有效地進行個別預訓練,並與基本文本到圖像 MMDiT 合併進行微調。最後,我們對用於大規模音訊和文本生成的修正流變壓器的設計選擇進行了全面研究,提供了有價值的見解,以優化跨不同模態的性能。代碼將在 https://github.com/jacklishufan/OmniFlows 上提供。
English
We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

Summary

AI-Generated Summary

PDF122December 6, 2024