ChatDiT:一種無需訓練的基準線,用於與擴散Transformer進行任務不可知的自由形式對話。
ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers
December 17, 2024
作者: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou
cs.AI
摘要
最近的研究 arXiv:2410.15027 和 arXiv:2410.23775 強調了預訓練擴散 Transformer(DiTs)的固有上下文生成能力,使它們能夠在最小或無需架構修改的情況下無縫適應各種視覺任務。這些能力是通過跨多個輸入和目標圖像連接自注意力標記,結合分組和遮罩生成管道而實現的。在此基礎上,我們提出了ChatDiT,這是一個零-shot、通用且互動式的視覺生成框架,利用預訓練的擴散 Transformer 在其原始形式下,無需額外調整、適配器或修改。用戶可以與 ChatDiT 互動,通過自由形式的自然語言在一個或多個對話回合中創建交錯的文本-圖像文章、多頁圖片書、編輯圖像、設計知識產權衍生品,或開發角色設計設置。在核心層面,ChatDiT 使用一個包括三個關鍵組件的多代理系統:一個解析指令的代理,解釋用戶上傳的圖像和指令,一個策略規劃代理,制定單步或多步生成操作,以及一個執行代理,使用擴散 Transformer 的上下文工具包執行這些操作。我們在 IDEA-Bench arXiv:2412.11767 上對 ChatDiT 進行了全面評估,包括 100 個現實世界的設計任務和 275 個具有不同指令和不同數量輸入和目標圖像的案例。儘管其簡單性和無需訓練的方法,ChatDiT 超越了所有競爭對手,包括那些專門設計並在廣泛多任務數據集上訓練的競爭對手。我們進一步確定了預訓練 DiTs 在零-shot 適應任務上的關鍵限制。我們釋放所有代碼、代理、結果和中間輸出,以促進進一步的研究,網址為 https://github.com/ali-vilab/ChatDiT。
English
Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the
inherent in-context generation capabilities of pretrained diffusion
transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks
with minimal or no architectural modifications. These capabilities are unlocked
by concatenating self-attention tokens across multiple input and target images,
combined with grouped and masked generation pipelines. Building upon this
foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive
visual generation framework that leverages pretrained diffusion transformers in
their original form, requiring no additional tuning, adapters, or
modifications. Users can interact with ChatDiT to create interleaved text-image
articles, multi-page picture books, edit images, design IP derivatives, or
develop character design settings, all through free-form natural language
across one or more conversational rounds. At its core, ChatDiT employs a
multi-agent system comprising three key components: an Instruction-Parsing
agent that interprets user-uploaded images and instructions, a
Strategy-Planning agent that devises single-step or multi-step generation
actions, and an Execution agent that performs these actions using an in-context
toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA-Bench
arXiv:2412.11767, comprising 100 real-world design tasks and 275 cases with
diverse instructions and varying numbers of input and target images. Despite
its simplicity and training-free approach, ChatDiT surpasses all competitors,
including those specifically designed and trained on extensive multi-task
datasets. We further identify key limitations of pretrained DiTs in zero-shot
adapting to tasks. We release all code, agents, results, and intermediate
outputs to facilitate further research at https://github.com/ali-vilab/ChatDiTSummary
AI-Generated Summary