MakeAnything:利用扩散Transformer进行多领域程序序列生成
MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation
February 3, 2025
作者: Yiren Song, Cheng Liu, Mike Zheng Shou
cs.AI
摘要
人类智能的一个标志是通过结构化的多步骤过程创造复杂的工件。利用人工智能生成过程教程是一个长期存在但具有挑战性的目标,面临着三个关键障碍:(1)多任务过程数据集的稀缺性,(2)在步骤之间保持逻辑连续性和视觉一致性,以及(3)在多个领域之间进行泛化。为了解决这些挑战,我们提出了一个涵盖 21 个任务的多领域数据集,包含超过 24,000 个过程序列。在此基础上,我们引入了基于扩散变换器(DIT)的 MakeAnything 框架,利用微调来激活 DIT 的上下文能力,生成一致的过程序列。我们引入了用于图像生成的不对称低秩适应(LoRA),通过冻结编码器参数并自适应调整解码器层来平衡泛化能力和任务特定性能。此外,我们的 ReCraft 模型通过时空一致性约束实现了图像到过程的生成,允许将静态图像分解为合理的创建序列。大量实验证明,MakeAnything 超越了现有方法,为过程生成任务设定了新的性能基准。
English
A hallmark of human intelligence is the ability to create complex artifacts
through structured multi-step processes. Generating procedural tutorials with
AI is a longstanding but challenging goal, facing three key obstacles: (1)
scarcity of multi-task procedural datasets, (2) maintaining logical continuity
and visual consistency between steps, and (3) generalizing across multiple
domains. To address these challenges, we propose a multi-domain dataset
covering 21 tasks with over 24,000 procedural sequences. Building upon this
foundation, we introduce MakeAnything, a framework based on the diffusion
transformer (DIT), which leverages fine-tuning to activate the in-context
capabilities of DIT for generating consistent procedural sequences. We
introduce asymmetric low-rank adaptation (LoRA) for image generation, which
balances generalization capabilities and task-specific performance by freezing
encoder parameters while adaptively tuning decoder layers. Additionally, our
ReCraft model enables image-to-process generation through spatiotemporal
consistency constraints, allowing static images to be decomposed into plausible
creation sequences. Extensive experiments demonstrate that MakeAnything
surpasses existing methods, setting new performance benchmarks for procedural
generation tasks.Summary
AI-Generated Summary