GenMAC: 다중 에이전트 협업을 통한 구성적 텍스트 대 비디오 생성

초록

텍스트-비디오 생성 모델은 최근 몇 년간 상당한 진전을 보였습니다. 그러나 여전히 복합적인 동적 장면을 생성하는 데 어려움을 겪고 있습니다. 이는 속성 바인딩, 다중 객체에 대한 시간적 동적 및 객체 간 상호 작용과 같은 구성 텍스트 프롬프트에 기반한 복합적인 동적 장면 생성에 어려움을 겪고 있음을 의미합니다. 우리의 주된 동기는 복잡한 작업을 각각 역할에 특화된 MLLM 에이전트가 처리하는 간단한 작업으로 분해할 수 있다는 것입니다. 다수의 에이전트가 복합적인 목표를 위해 협력하여 모여 지능을 발휘할 수 있습니다. 우리는 구성적인 텍스트-비디오 생성을 가능하게 하는 반복적이고 다중 에이전트 프레임워크인 GenMAC을 제안합니다. 협력적인 워크플로우는 설계, 생성 및 재설계 세 단계로 구성되며, 생성 및 재설계 단계 사이의 반복 루프를 통해 생성된 비디오를 점진적으로 검증하고 개선합니다. 재설계 단계는 생성된 비디오를 검증하고 수정 제안을 하며 다음 생성 반복을 위해 텍스트 프롬프트, 프레임별 레이아웃 및 가이드 스케일을 재설계하는 가장 어려운 단계입니다. 단일 MLLM 에이전트의 환영을 피하기 위해 이 단계를 순차적으로 실행되는 네 가지 MLLM 기반 에이전트로 분해합니다: 검증 에이전트, 제안 에이전트, 수정 에이전트 및 출력 구조화 에이전트. 또한, 구성적인 텍스트-비디오 생성의 다양한 시나리오에 대응하기 위해 적절한 수정 에이전트를 선택하기 위한 자가 라우팅 메커니즘을 설계합니다. 각 수정 에이전트는 하나의 시나리오에 특화되어 있습니다. 광범위한 실험은 GenMAC의 효과를 입증하며, 구성적인 텍스트-비디오 생성에서 최고 수준의 성능을 달성합니다.

English

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.

GenMAC: 다중 에이전트 협업을 통한 구성적 텍스트 대 비디오 생성

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

초록

Summary

Support