ChatDiT: 확산 트랜스포머를 사용한 과제에 중립적인 자유형 대화를 위한 훈련 없는 기준선

초록

최근 연구 arXiv:2410.15027 및 arXiv:2410.23775은 사전 훈련된 확산 트랜스포머(DiTs)의 본질적인 문맥 생성 능력을 강조하며, 이를 통해 다양한 시각 작업에 대해 최소한의 구조적 수정 없이 원활하게 적응할 수 있다는 것을 보여주었습니다. 이러한 능력은 여러 입력 및 대상 이미지 간의 self-attention 토큰을 연결하고, 그룹화 및 마스킹된 생성 파이프라인과 결합함으로써 발휘됩니다. 이 기반 위에 구축된 ChatDiT는 사전 훈련된 확산 트랜스포머를 그대로 활용하는 제로샷, 일반용, 상호작용형 시각 생성 프레임워크를 제시합니다. 이는 추가적인 튜닝, 어댑터 또는 수정 없이 사용됩니다. 사용자는 ChatDiT를 통해 자유로운 자연어를 통해 대화식 라운드를 통해 교차로운 텍스트-이미지 기사, 다중 페이지 그림책, 이미지 편집, IP 파생품 디자인 또는 캐릭터 디자인 설정을 만들 수 있습니다. ChatDiT의 핵심은 세 가지 주요 구성 요소로 이루어진 다중 에이전트 시스템을 활용합니다: 사용자가 업로드한 이미지와 지침을 해석하는 Instruction-Parsing 에이전트, 단일 단계 또는 다단계 생성 작업을 고안하는 Strategy-Planning 에이전트, 그리고 이러한 작업을 수행하는 Execution 에이전트로 구성된 문맥 내 확산 트랜스포머 툴킷을 사용합니다. 우리는 IDEA-Bench arXiv:2412.11767에서 ChatDiT를 철저히 평가하였으며, 이는 100가지 실제 디자인 작업과 다양한 지침 및 다양한 수의 입력 및 대상 이미지를 포함하는 275개 사례로 구성되어 있습니다. 그 간단함과 훈련 없는 방식에도 불구하고, ChatDiT는 광범위한 멀티태스크 데이터셋에 특별히 설계되고 훈련된 경쟁 상대를 포함하여 모든 경쟁 상대를 능가합니다. 우리는 또한 제로샷 작업에 대한 사전 훈련된 DiTs의 주요 한계를 확인합니다. 우리는 추가 연구를 위해 모든 코드, 에이전트, 결과 및 중간 출력물을 공개하였습니다. (https://github.com/ali-vilab/ChatDiT)

English

Recent research arXiv:2410.15027 arXiv:2410.23775 has highlighted the inherent in-context generation capabilities of pretrained diffusion transformers (DiTs), enabling them to seamlessly adapt to diverse visual tasks with minimal or no architectural modifications. These capabilities are unlocked by concatenating self-attention tokens across multiple input and target images, combined with grouped and masked generation pipelines. Building upon this foundation, we present ChatDiT, a zero-shot, general-purpose, and interactive visual generation framework that leverages pretrained diffusion transformers in their original form, requiring no additional tuning, adapters, or modifications. Users can interact with ChatDiT to create interleaved text-image articles, multi-page picture books, edit images, design IP derivatives, or develop character design settings, all through free-form natural language across one or more conversational rounds. At its core, ChatDiT employs a multi-agent system comprising three key components: an Instruction-Parsing agent that interprets user-uploaded images and instructions, a Strategy-Planning agent that devises single-step or multi-step generation actions, and an Execution agent that performs these actions using an in-context toolkit of diffusion transformers. We thoroughly evaluate ChatDiT on IDEA-Bench arXiv:2412.11767, comprising 100 real-world design tasks and 275 cases with diverse instructions and varying numbers of input and target images. Despite its simplicity and training-free approach, ChatDiT surpasses all competitors, including those specifically designed and trained on extensive multi-task datasets. We further identify key limitations of pretrained DiTs in zero-shot adapting to tasks. We release all code, agents, results, and intermediate outputs to facilitate further research at https://github.com/ali-vilab/ChatDiT

ChatDiT: 확산 트랜스포머를 사용한 과제에 중립적인 자유형 대화를 위한 훈련 없는 기준선

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

초록

Summary

Support

Support