GenMAC:具有多智能體協作的組合式文本到視頻生成

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

December 5, 2024
作者: Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu
cs.AI

摘要

近年來,文字到影片生成模型取得了顯著進展。然而,它們仍然在基於組合式文本提示生成複雜動態場景方面遇到困難,例如多個物體的屬性綁定、不同物體的時間動態以及物體之間的互動。我們的主要動機是將複雜任務分解為更簡單的任務,每個任務由一個角色專門化的MLLM代理處理。多個代理可以共同合作,以實現複雜目標的集體智能。我們提出了GenMAC,一個迭代的多代理框架,可以實現組合式文字到影片生成。合作工作流程包括三個階段:設計、生成和重新設計,其中在生成和重新設計階段之間進行迭代循環,逐步驗證和完善生成的影片。重新設計階段是最具挑戰性的階段,旨在驗證生成的影片,提出修正建議,並重新設計下一輪生成的文本提示、逐幀佈局和引導比例。為了避免單個MLLM代理的幻覺,我們將這個階段分解為四個依次執行的基於MLLM的代理:驗證代理、建議代理、修正代理和輸出結構化代理。此外,為了應對各種組合式文字到影片生成的情境,我們設計了一個自我路由機制,從一組針對不同情境專門化的修正代理中自適應地選擇適當的修正代理。大量實驗證明了GenMAC的有效性,在組合式文字到影片生成方面實現了最先進的性能。
English
Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.

Summary

AI-Generated Summary

PDF192December 9, 2024