ChatDiT:一種無需訓練的基準線,用於與擴散Transformer進行任務不可知的自由形式對話。

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

December 17, 2024
作者: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou
cs.AI

摘要

最近的研究 arXiv:2410.15027 和 arXiv:2410.23775 強調了預訓練擴散 Transformer(DiTs)的固有上下文生成能力,使它們能夠在最小或無需架構修改的情況下無縫適應各種視覺任務。這些能力是通過跨多個輸入和目標圖像連接自注意力標記,結合分組和遮罩生成管道而實現的。在此基礎上,我們提出了ChatDiT,這是一個零-shot、通用且互動式的視覺生成框架,利用預訓練的擴散 Transformer 在其原始形式下,無需額外調整、適配器或修改。用戶可以與 ChatDiT 互動,通過自由形式的自然語言在一個或多個對話回合中創建交錯的文本-圖像文章、多頁圖片書、編輯圖像、設計知識產權衍生品,或開發角色設計設置。在核心層面,ChatDiT 使用一個包括三個關鍵組件的多代理系統:一個解析指令的代理,解釋用戶上傳的圖像和指令,一個策略規劃代理,制定單步或多步生成操作,以及一個執行代理,使用擴散 Transformer 的上下文工具包執行這些操作。我們在 IDEA-Bench arXiv:2412.11767 上對 ChatDiT 進行了全面評估,包括 100 個現實世界的設計任務和 275 個具有不同指令和不同數量輸入和目標圖像的案例。儘管其簡單性和無需訓練的方法,ChatDiT 超越了所有競爭對手,包括那些專門設計並在廣泛多任務數據集上訓練的競爭對手。我們進一步確定了預訓練 DiTs 在零-shot 適應任務上的關鍵限制。我們釋放所有代碼、代理、結果和中間輸出,以促進進一步的研究,網址為 https://github.com/ali-vilab/ChatDiT。
English
Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text-image articles, multi-page picture books, edit images, design IP derivatives, or develop character design settings, all through free-form natural language across one or more conversational rounds. At its core, ChatDiT employs a multi-agent system comprising three key components: an Instruction-Parsing agent that interprets user-uploaded images and instructions, a Strategy-Planning agent that devises single-step or multi-step generation actions, and an Execution agent that performs these actions using an in-context toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA-Bench arXiv:2412.11767, comprising 100 real-world design tasks and 275 cases with diverse instructions and varying numbers of input and target images. Despite its simplicity and training-free approach, ChatDiT surpasses all competitors, including those specifically designed and trained on extensive multi-task datasets. We further identify key limitations of pretrained DiTs in zero-shot adapting to tasks. We release all code, agents, results, and intermediate outputs to facilitate further research at https://github.com/ali-vilab/ChatDiT

Summary

AI-Generated Summary

PDF82December 19, 2024