GenMAC:多智能体协作的组合式文本到视频生成
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
December 5, 2024
作者: Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu
cs.AI
摘要
文本到视频生成模型在近年来取得了显著进展。然而,它们仍然在基于组合文本提示生成复杂动态场景方面存在困难,例如多个对象的属性绑定、不同对象的时间动态以及对象之间的交互。我们的主要动机是将复杂任务分解为更简单的任务,每个任务由一个专门角色的MLLM代理处理。多个代理可以共同合作,以实现复杂目标的集体智能。我们提出了GenMAC,这是一个迭代的多代理框架,可以实现组合文本到视频的生成。协作工作流程包括三个阶段:设计、生成和重设计,其中在生成和重设计阶段之间有一个迭代循环,逐步验证和完善生成的视频。重设计阶段是最具挑战性的阶段,旨在验证生成的视频,提出更正建议,并重新设计文本提示、逐帧布局和引导比例,以供下一轮生成使用。为了避免单个MLLM代理的幻觉,我们将这个阶段分解为四个顺序执行的基于MLLM的代理:验证代理、建议代理、更正代理和输出结构化代理。此外,为了解决组合文本到视频生成的多样情景,我们设计了一种自适应选择适当更正代理的自路由机制,该机制包含一组针对不同情景专门设计的更正代理。大量实验证明了GenMAC的有效性,在组合文本到视频生成方面取得了最先进的性能。
English
Text-to-video generation models have shown significant progress in the recent
years. However, they still struggle with generating complex dynamic scenes
based on compositional text prompts, such as attribute binding for multiple
objects, temporal dynamics associated with different objects, and interactions
between objects. Our key motivation is that complex tasks can be decomposed
into simpler ones, each handled by a role-specialized MLLM agent. Multiple
agents can collaborate together to achieve collective intelligence for complex
goals. We propose GenMAC, an iterative, multi-agent framework that enables
compositional text-to-video generation. The collaborative workflow includes
three stages: Design, Generation, and Redesign, with an iterative loop between
the Generation and Redesign stages to progressively verify and refine the
generated videos. The Redesign stage is the most challenging stage that aims to
verify the generated videos, suggest corrections, and redesign the text
prompts, frame-wise layouts, and guidance scales for the next iteration of
generation. To avoid hallucination of a single MLLM agent, we decompose this
stage to four sequentially-executed MLLM-based agents: verification agent,
suggestion agent, correction agent, and output structuring agent. Furthermore,
to tackle diverse scenarios of compositional text-to-video generation, we
design a self-routing mechanism to adaptively select the proper correction
agent from a collection of correction agents each specialized for one scenario.
Extensive experiments demonstrate the effectiveness of GenMAC, achieving
state-of-the art performance in compositional text-to-video generation.Summary
AI-Generated Summary