思维视频生成:多镜头视频生成的协作框架
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
December 3, 2024
作者: Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim
cs.AI
摘要
当前的视频生成模型擅长生成短视频片段,但在创建多镜头、类似电影的视频方面仍存在困难。现有模型在大规模数据和丰富计算资源支持下训练,很自然地无法保持跨多个镜头的逻辑故事情节和视觉一致性,因为它们通常是以单镜头目标进行训练的。为此,我们提出了一种名为“思维视频生成器”(VGoT)的协作式、无需训练的架构,专门用于多镜头视频生成。VGoT 设计时考虑了三个目标,具体如下。多镜头视频生成:我们将视频生成过程分为一个结构化的模块化序列,包括(1)剧本生成,将简短故事转化为每个镜头的详细提示;(2)关键帧生成,负责创建与角色刻画相符的视觉一致的关键帧;以及(3)镜头级视频生成,将剧本和关键帧的信息转化为镜头;(4)平滑机制,确保一致的多镜头输出。合理的叙事设计:受电影剧本写作启发,我们的提示生成方法涵盖了五个关键领域,确保整个视频中的逻辑一致性、角色发展和叙事流畅性。跨镜头一致性:我们通过利用跨镜头的保持身份(IP)嵌入来确保时间和身份的一致性,这些嵌入是从叙事中自动生成的。此外,我们还融入了跨镜头平滑机制,该机制整合了一个重置边界,有效地结合了相邻镜头的潜在特征,实现平滑过渡,并在整个视频中保持视觉连贯性。我们的实验证明,VGoT在生成高质量、连贯的多镜头视频方面超越了现有的视频生成方法。
English
Current video generation models excel at generating short clips but still
struggle with creating multi-shot, movie-like videos. Existing models trained
on large-scale data on the back of rich computational resources are
unsurprisingly inadequate for maintaining a logical storyline and visual
consistency across multiple shots of a cohesive script since they are often
trained with a single-shot objective. To this end, we propose
VideoGen-of-Thought (VGoT), a collaborative and training-free architecture
designed specifically for multi-shot video generation. VGoT is designed with
three goals in mind as follows. Multi-Shot Video Generation: We divide the
video generation process into a structured, modular sequence, including (1)
Script Generation, which translates a curt story into detailed prompts for each
shot; (2) Keyframe Generation, responsible for creating visually consistent
keyframes faithful to character portrayals; and (3) Shot-Level Video
Generation, which transforms information from scripts and keyframes into shots;
(4) Smoothing Mechanism that ensures a consistent multi-shot output. Reasonable
Narrative Design: Inspired by cinematic scriptwriting, our prompt generation
approach spans five key domains, ensuring logical consistency, character
development, and narrative flow across the entire video. Cross-Shot
Consistency: We ensure temporal and identity consistency by leveraging
identity-preserving (IP) embeddings across shots, which are automatically
created from the narrative. Additionally, we incorporate a cross-shot smoothing
mechanism, which integrates a reset boundary that effectively combines latent
features from adjacent shots, resulting in smooth transitions and maintaining
visual coherence throughout the video. Our experiments demonstrate that VGoT
surpasses existing video generation methods in producing high-quality,
coherent, multi-shot videos.Summary
AI-Generated Summary