使用分段交叉注意力和內容豐富的影片資料整理生成長影片散播
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
December 2, 2024
作者: Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang
cs.AI
摘要
我們介紹了 Presto,一種新穎的視頻擴散模型,旨在生成具有長程一致性和豐富內容的 15 秒視頻。將視頻生成方法擴展以在長時間內保持場景多樣性帶來了重大挑戰。為了應對這一挑戰,我們提出了分段交叉注意(SCA)策略,將隱藏狀態沿時間維度分為段,使每個段可以跨越關注相應的子標題。SCA 不需要額外的參數,可無縫地融入當前基於 DiT 的架構中。為了促進高質量的長視頻生成,我們構建了 LongTake-HD 數據集,包含 261k 個內容豐富的視頻,具有場景一致性,並附有整體視頻標題和五個漸進的子標題。實驗表明,我們的 Presto 在 VBench 語義分數上達到了 78.5%,在動態程度上達到了 100%,優於現有的最先進視頻生成方法。這表明我們提出的 Presto 顯著增強了內容豐富性,保持了長程一致性,並捕捉了複雜的文本細節。更多細節請參見我們的項目頁面:https://presto-video.github.io/。
English
We introduce Presto, a novel video diffusion model designed to generate
15-second videos with long-range coherence and rich content. Extending video
generation methods to maintain scenario diversity over long durations presents
significant challenges. To address this, we propose a Segmented Cross-Attention
(SCA) strategy, which splits hidden states into segments along the temporal
dimension, allowing each segment to cross-attend to a corresponding
sub-caption. SCA requires no additional parameters, enabling seamless
incorporation into current DiT-based architectures. To facilitate high-quality
long video generation, we build the LongTake-HD dataset, consisting of 261k
content-rich videos with scenario coherence, annotated with an overall video
caption and five progressive sub-captions. Experiments show that our Presto
achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree,
outperforming existing state-of-the-art video generation methods. This
demonstrates that our proposed Presto significantly enhances content richness,
maintains long-range coherence, and captures intricate textual details. More
details are displayed on our project page: https://presto-video.github.io/.Summary
AI-Generated Summary