ChatPaper.aiChatPaper

ACE:通過擴散Transformer遵循指令的全能創作者和編輯者

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

September 30, 2024
作者: Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chaojie Mao, Chenwei Xie, Yu Liu, Jingren Zhou
cs.AI

摘要

擴散模型已成為一種強大的生成技術,並被發現適用於各種情境。大多數現有的基礎擴散模型主要設計用於文本引導的視覺生成,並不支援多模態條件,而這對於許多視覺編輯任務至關重要。這種限制阻礙了這些基礎擴散模型在視覺生成領域中像 GPT-4 在自然語言處理領域中一樣作為統一模型。在這項工作中,我們提出 ACE,一個全能創作者和編輯,其在各種視覺生成任務中實現了與專家模型相當的性能。為了實現這一目標,我們首先引入了統一的條件格式,稱為長上下文條件單元(LCU),並提出了一種使用 LCU 作為輸入的新型基於 Transformer 的擴散模型,旨在跨越各種生成和編輯任務進行聯合訓練。此外,我們提出了一種有效的數據收集方法來解決缺乏可用訓練數據的問題。它涉及使用基於合成或基於聚類的流水線獲取成對圖像,並通過利用微調的多模態大型語言模型為這些成對圖像提供準確的文本指令。為了全面評估我們模型的性能,我們建立了一個手動標註的成對數據基準,涵蓋各種視覺生成任務。廣泛的實驗結果顯示了我們模型在視覺生成領域的優越性。由於我們模型的全能功能,我們可以輕鬆構建一個多模態聊天系統,使用單一模型作為後端來回應任何關於圖像創建的互動請求,避免了通常在視覺代理中使用的繁瑣流水線。代碼和模型將在項目頁面上提供:https://ali-vilab.github.io/ace-page/。
English
Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal large language model. To comprehensively evaluate the performance of our model, we establish a benchmark of manually annotated pairs data across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a multi-modal chat system that responds to any interactive request for image creation using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents. Code and models will be available on the project page: https://ali-vilab.github.io/ace-page/.

Summary

AI-Generated Summary

PDF122November 13, 2024