VideoGen-of-Thought:一個用於多鏡頭視頻生成的協作框架
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
December 3, 2024
作者: Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim
cs.AI
摘要
目前的視頻生成模型擅長生成短片,但在創建多鏡頭、類似電影的視頻方面仍然存在困難。現有的模型在豐富的計算資源支持下訓練於大規模數據,往往無法保持跨多個鏡頭的邏輯故事情節和視覺一致性,因為它們通常是以單鏡頭目標進行訓練的。為此,我們提出了一種名為“思維視頻生成器”(VGoT)的協作和無需訓練的架構,專門用於多鏡頭視頻生成。VGoT 設計時考慮了三個目標,具體如下。多鏡頭視頻生成:我們將視頻生成過程分為結構化的模塊序列,包括(1)劇本生成,將簡短故事轉換為每個鏡頭的詳細提示;(2)關鍵幀生成,負責創建與角色塑造相符的視覺一致的關鍵幀;和(3)鏡頭級視頻生成,將劇本和關鍵幀的信息轉換為鏡頭;(4)平滑機制確保一致的多鏡頭輸出。合理的敘事設計:受電影劇本撰寫的啟發,我們的提示生成方法涵蓋五個關鍵領域,確保整個視頻中的邏輯一致性、角色發展和敘事流暢。跨鏡頭一致性:我們通過利用跨鏡頭的保持身份(IP)嵌入來確保時間和身份的一致性,這些嵌入是從敘事中自動創建的。此外,我們還融入了一種跨鏡頭平滑機制,該機制整合了一個重置邊界,有效地結合相鄰鏡頭的潛在特徵,實現平滑過渡,並在整個視頻中保持視覺一致性。我們的實驗表明,VGoT 在生成高質量、連貫的多鏡頭視頻方面超越了現有的視頻生成方法。
English
Current video generation models excel at generating short clips but still
struggle with creating multi-shot, movie-like videos. Existing models trained
on large-scale data on the back of rich computational resources are
unsurprisingly inadequate for maintaining a logical storyline and visual
consistency across multiple shots of a cohesive script since they are often
trained with a single-shot objective. To this end, we propose
VideoGen-of-Thought (VGoT), a collaborative and training-free architecture
designed specifically for multi-shot video generation. VGoT is designed with
three goals in mind as follows. Multi-Shot Video Generation: We divide the
video generation process into a structured, modular sequence, including (1)
Script Generation, which translates a curt story into detailed prompts for each
shot; (2) Keyframe Generation, responsible for creating visually consistent
keyframes faithful to character portrayals; and (3) Shot-Level Video
Generation, which transforms information from scripts and keyframes into shots;
(4) Smoothing Mechanism that ensures a consistent multi-shot output. Reasonable
Narrative Design: Inspired by cinematic scriptwriting, our prompt generation
approach spans five key domains, ensuring logical consistency, character
development, and narrative flow across the entire video. Cross-Shot
Consistency: We ensure temporal and identity consistency by leveraging
identity-preserving (IP) embeddings across shots, which are automatically
created from the narrative. Additionally, we incorporate a cross-shot smoothing
mechanism, which integrates a reset boundary that effectively combines latent
features from adjacent shots, resulting in smooth transitions and maintaining
visual coherence throughout the video. Our experiments demonstrate that VGoT
surpasses existing video generation methods in producing high-quality,
coherent, multi-shot videos.Summary
AI-Generated Summary