ChatDiT：一种无需训练的任务不可知自由形式聊天基准，采用扩散变压器。

摘要

最近的研究arXiv:2410.15027和arXiv:2410.23775突出了预训练扩散变压器（DiTs）固有的上下文生成能力，使它们能够在最小或无架构修改的情况下无缝地适应各种视觉任务。这些能力是通过跨多个输入和目标图像连接自注意力标记，结合分组和屏蔽生成管道而实现的。在此基础上，我们提出了ChatDiT，这是一个零样本、通用且交互式的视觉生成框架，利用了预训练的扩散变压器的原始形式，无需额外调整、适配器或修改。用户可以与ChatDiT交互，通过自由形式的自然语言在一个或多个对话轮次中创建交错的文本-图像文章、多页图片书、编辑图像、设计IP衍生品或开发角色设计设置。在核心部分，ChatDiT采用了一个由三个关键组件组成的多代理系统：一个解释用户上传图像和指令的指令解析代理、一个制定单步或多步生成动作的策略规划代理，以及一个使用扩散变压器的上下文工具包执行这些动作的执行代理。我们在IDEA-Bench arXiv:2412.11767上对ChatDiT进行了全面评估，包括100个真实设计任务和275个具有不同指令和不同数量输入和目标图像的案例。尽管ChatDiT的简单性和无需训练的方法，它超越了所有竞争对手，包括那些专门设计并在广泛多任务数据集上训练的对手。我们进一步确定了预训练DiTs在零样本适应任务中的关键限制。我们发布了所有代码、代理、结果和中间输出，以促进进一步研究，网址为https://github.com/ali-vilab/ChatDiT。

English

Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text-image articles, multi-page picture books, edit images, design IP derivatives, or develop character design settings, all through free-form natural language across one or more conversational rounds. At its core, ChatDiT employs a multi-agent system comprising three key components: an Instruction-Parsing agent that interprets user-uploaded images and instructions, a Strategy-Planning agent that devises single-step or multi-step generation actions, and an Execution agent that performs these actions using an in-context toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA-Bench arXiv:2412.11767, comprising 100 real-world design tasks and 275 cases with diverse instructions and varying numbers of input and target images. Despite its simplicity and training-free approach, ChatDiT surpasses all competitors, including those specifically designed and trained on extensive multi-task datasets. We further identify key limitations of pretrained DiTs in zero-shot adapting to tasks. We release all code, agents, results, and intermediate outputs to facilitate further research at https://github.com/ali-vilab/ChatDiT

ChatDiT：一种无需训练的任务不可知自由形式聊天基准，采用扩散变压器。

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

摘要

Summary

Support