CINEMA:基于多模态大语言模型引导的连贯多主体视频生成
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
March 13, 2025
作者: Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, Chongyang Ma
cs.AI
摘要
随着深度生成模型,尤其是扩散模型的出现,视频生成领域取得了显著进展。尽管现有方法在从文本提示或单张图像生成高质量视频方面表现出色,但个性化多主体视频生成仍是一个尚未充分探索的挑战。这一任务涉及合成包含多个独立主体的视频,每个主体由不同的参考图像定义,同时确保时间和空间上的一致性。当前方法主要依赖于将主体图像映射到文本提示中的关键词,这引入了模糊性,并限制了其有效建模主体间关系的能力。本文中,我们提出了CINEMA,一种利用多模态大语言模型(MLLM)实现连贯多主体视频生成的新框架。我们的方法消除了主体图像与文本实体间显式对应的需求,从而减少了模糊性和标注工作量。通过利用MLLM解析主体间关系,我们的方法促进了可扩展性,使得能够使用大规模多样化数据集进行训练。此外,我们的框架能够适应不同数量的主体条件,为个性化内容创作提供了更大的灵活性。通过广泛的评估,我们证明了该方法在主体一致性和整体视频连贯性方面显著提升,为故事叙述、互动媒体和个性化视频生成等高级应用铺平了道路。
English
Video generation has witnessed remarkable progress with the advent of deep
generative models, particularly diffusion models. While existing methods excel
in generating high-quality videos from text prompts or single images,
personalized multi-subject video generation remains a largely unexplored
challenge. This task involves synthesizing videos that incorporate multiple
distinct subjects, each defined by separate reference images, while ensuring
temporal and spatial consistency. Current approaches primarily rely on mapping
subject images to keywords in text prompts, which introduces ambiguity and
limits their ability to model subject relationships effectively. In this paper,
we propose CINEMA, a novel framework for coherent multi-subject video
generation by leveraging Multimodal Large Language Model (MLLM). Our approach
eliminates the need for explicit correspondences between subject images and
text entities, mitigating ambiguity and reducing annotation effort. By
leveraging MLLM to interpret subject relationships, our method facilitates
scalability, enabling the use of large and diverse datasets for training.
Furthermore, our framework can be conditioned on varying numbers of subjects,
offering greater flexibility in personalized content creation. Through
extensive evaluations, we demonstrate that our approach significantly improves
subject consistency, and overall video coherence, paving the way for advanced
applications in storytelling, interactive media, and personalized video
generation.Summary
AI-Generated Summary