RepVideo:重新思考视频生成的跨层表示
RepVideo: Rethinking Cross-Layer Representation for Video Generation
January 15, 2025
作者: Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu
cs.AI
摘要
随着扩散模型的引入,视频生成取得了显著进展,大大提高了生成视频的质量。然而,最近的研究主要集中在扩大模型训练规模,却很少深入探讨表示对视频生成过程的直接影响。本文首先研究了中间层特征的特性,发现不同层之间的注意力图存在显著变化。这些变化导致了不稳定的语义表示,并导致特征之间的累积差异,最终降低了相邻帧之间的相似性,对时间连贯性产生负面影响。为了解决这一问题,我们提出了RepVideo,这是一个增强的文本到视频扩散模型表示框架。通过累积相邻层的特征来形成丰富的表示,这种方法捕捉了更稳定的语义信息。然后,这些增强的表示被用作注意力机制的输入,从而提高了语义表达能力,同时确保了相邻帧之间的特征一致性。大量实验证明,我们的RepVideo不仅显著增强了生成准确的空间外观能力,例如捕捉多个对象之间复杂的空间关系,还提高了视频生成中的时间一致性。
English
Video generation has achieved remarkable progress with the introduction of
diffusion models, which have significantly improved the quality of generated
videos. However, recent research has primarily focused on scaling up model
training, while offering limited insights into the direct impact of
representations on the video generation process. In this paper, we initially
investigate the characteristics of features in intermediate layers, finding
substantial variations in attention maps across different layers. These
variations lead to unstable semantic representations and contribute to
cumulative differences between features, which ultimately reduce the similarity
between adjacent frames and negatively affect temporal coherence. To address
this, we propose RepVideo, an enhanced representation framework for
text-to-video diffusion models. By accumulating features from neighboring
layers to form enriched representations, this approach captures more stable
semantic information. These enhanced representations are then used as inputs to
the attention mechanism, thereby improving semantic expressiveness while
ensuring feature consistency across adjacent frames. Extensive experiments
demonstrate that our RepVideo not only significantly enhances the ability to
generate accurate spatial appearances, such as capturing complex spatial
relationships between multiple objects, but also improves temporal consistency
in video generation.Summary
AI-Generated Summary