RepVideo:重新思考視頻生成的跨層表示
RepVideo: Rethinking Cross-Layer Representation for Video Generation
January 15, 2025
作者: Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu
cs.AI
摘要
隨著擴散模型的引入,視頻生成取得了顯著的進展,大幅提升了生成視頻的質量。然而,近期的研究主要集中在擴大模型訓練規模,對表示對視頻生成過程的直接影響提供的洞察有限。本文首先探討中間層特徵的特性,發現不同層之間的注意力映射存在顯著變化。這些變化導致不穩定的語義表示,並導致特徵之間的累積差異,最終降低相鄰幀之間的相似性,並對時間一致性產生負面影響。為解決這一問題,我們提出了RepVideo,一種增強的表示框架,適用於文本到視頻擴散模型。通過從相鄰層累積特徵以形成豐富的表示,該方法捕獲更穩定的語義信息。這些增強的表示然後被用作注意機制的輸入,從而提高語義表達能力,同時確保相鄰幀之間的特徵一致性。大量實驗表明,我們的RepVideo不僅顯著增強了生成準確的空間外觀的能力,例如捕捉多個對象之間的復雜空間關係,還改善了視頻生成中的時間一致性。
English
Video generation has achieved remarkable progress with the introduction of
diffusion models, which have significantly improved the quality of generated
videos. However, recent research has primarily focused on scaling up model
training, while offering limited insights into the direct impact of
representations on the video generation process. In this paper, we initially
investigate the characteristics of features in intermediate layers, finding
substantial variations in attention maps across different layers. These
variations lead to unstable semantic representations and contribute to
cumulative differences between features, which ultimately reduce the similarity
between adjacent frames and negatively affect temporal coherence. To address
this, we propose RepVideo, an enhanced representation framework for
text-to-video diffusion models. By accumulating features from neighboring
layers to form enriched representations, this approach captures more stable
semantic information. These enhanced representations are then used as inputs to
the attention mechanism, thereby improving semantic expressiveness while
ensuring feature consistency across adjacent frames. Extensive experiments
demonstrate that our RepVideo not only significantly enhances the ability to
generate accurate spatial appearances, such as capturing complex spatial
relationships between multiple objects, but also improves temporal consistency
in video generation.Summary
AI-Generated Summary