ChatDiT:一种无需训练的任务不可知自由形式聊天基准,采用扩散变压器。
ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers
December 17, 2024
作者: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Chen Liang, Tong Shen, Han Zhang, Huanzhang Dou, Yu Liu, Jingren Zhou
cs.AI
摘要
最近的研究arXiv:2410.15027和arXiv:2410.23775突出了预训练扩散变压器(DiTs)固有的上下文生成能力,使它们能够在最小或无架构修改的情况下无缝地适应各种视觉任务。这些能力是通过跨多个输入和目标图像连接自注意力标记,结合分组和屏蔽生成管道而实现的。在此基础上,我们提出了ChatDiT,这是一个零样本、通用且交互式的视觉生成框架,利用了预训练的扩散变压器的原始形式,无需额外调整、适配器或修改。用户可以与ChatDiT交互,通过自由形式的自然语言在一个或多个对话轮次中创建交错的文本-图像文章、多页图片书、编辑图像、设计IP衍生品或开发角色设计设置。在核心部分,ChatDiT采用了一个由三个关键组件组成的多代理系统:一个解释用户上传图像和指令的指令解析代理、一个制定单步或多步生成动作的策略规划代理,以及一个使用扩散变压器的上下文工具包执行这些动作的执行代理。我们在IDEA-Bench arXiv:2412.11767上对ChatDiT进行了全面评估,包括100个真实设计任务和275个具有不同指令和不同数量输入和目标图像的案例。尽管ChatDiT的简单性和无需训练的方法,它超越了所有竞争对手,包括那些专门设计并在广泛多任务数据集上训练的对手。我们进一步确定了预训练DiTs在零样本适应任务中的关键限制。我们发布了所有代码、代理、结果和中间输出,以促进进一步研究,网址为https://github.com/ali-vilab/ChatDiT。
English
Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the
inherent in-context generation capabilities of pretrained diffusion
transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks
with minimal or no architectural modifications. These capabilities are unlocked
by concatenating self-attention tokens across multiple input and target images,
combined with grouped and masked generation pipelines. Building upon this
foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive
visual generation framework that leverages pretrained diffusion transformers in
their original form, requiring no additional tuning, adapters, or
modifications. Users can interact with ChatDiT to create interleaved text-image
articles, multi-page picture books, edit images, design IP derivatives, or
develop character design settings, all through free-form natural language
across one or more conversational rounds. At its core, ChatDiT employs a
multi-agent system comprising three key components: an Instruction-Parsing
agent that interprets user-uploaded images and instructions, a
Strategy-Planning agent that devises single-step or multi-step generation
actions, and an Execution agent that performs these actions using an in-context
toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA-Bench
arXiv:2412.11767, comprising 100 real-world design tasks and 275 cases with
diverse instructions and varying numbers of input and target images. Despite
its simplicity and training-free approach, ChatDiT surpasses all competitors,
including those specifically designed and trained on extensive multi-task
datasets. We further identify key limitations of pretrained DiTs in zero-shot
adapting to tasks. We release all code, agents, results, and intermediate
outputs to facilitate further research at https://github.com/ali-vilab/ChatDiTSummary
AI-Generated Summary