CINEMA：基于多模态大语言模型引导的连贯多主体视频生成

摘要

随着深度生成模型，尤其是扩散模型的出现，视频生成领域取得了显著进展。尽管现有方法在从文本提示或单张图像生成高质量视频方面表现出色，但个性化多主体视频生成仍是一个尚未充分探索的挑战。这一任务涉及合成包含多个独立主体的视频，每个主体由不同的参考图像定义，同时确保时间和空间上的一致性。当前方法主要依赖于将主体图像映射到文本提示中的关键词，这引入了模糊性，并限制了其有效建模主体间关系的能力。本文中，我们提出了CINEMA，一种利用多模态大语言模型（MLLM）实现连贯多主体视频生成的新框架。我们的方法消除了主体图像与文本实体间显式对应的需求，从而减少了模糊性和标注工作量。通过利用MLLM解析主体间关系，我们的方法促进了可扩展性，使得能够使用大规模多样化数据集进行训练。此外，我们的框架能够适应不同数量的主体条件，为个性化内容创作提供了更大的灵活性。通过广泛的评估，我们证明了该方法在主体一致性和整体视频连贯性方面显著提升，为故事叙述、互动媒体和个性化视频生成等高级应用铺平了道路。

English

Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.

CINEMA：基于多模态大语言模型引导的连贯多主体视频生成

CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance

摘要

Summary

Support