视频RoPE:什么构成了优秀的视频旋转位置嵌入?
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
February 7, 2025
作者: Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
cs.AI
摘要
尽管旋转位置嵌入(RoPE)及其变体因其长上下文能力而被广泛采用,但将一维RoPE扩展到具有复杂时空结构的视频仍然是一个未解之谜。本研究首先引入了一项全面分析,确定了四个关键特征,这些特征对RoPE成功适应视频至关重要,而这些特征在先前的研究中尚未得到充分考虑。作为我们分析的一部分,我们引入了一个具有挑战性的V-NIAH-D(带干扰物的视觉找针在草垛中)任务,该任务在V-NIAH中添加了周期性干扰物。V-NIAH-D任务表明,先前的RoPE变体由于缺乏适当的时间维度分配而容易被干扰物误导。基于我们的分析,我们引入了VideoRoPE,其具有设计良好的三维结构,以保持时空关系。VideoRoPE具有低频时间分配,以减轻周期性振荡,对角布局以保持空间对称性,以及可调节的时间间距以解耦时间和空间索引。VideoRoPE在各种下游任务中始终优于先前的RoPE变体,如长视频检索、视频理解和视频幻觉。我们的代码将在以下网址提供:https://github.com/Wiselnn570/VideoRoPE。
English
While Rotary Position Embedding (RoPE) and its variants are widely adopted
for their long-context capabilities, the extension of the 1D RoPE to video,
with its complex spatio-temporal structure, remains an open challenge. This
work first introduces a comprehensive analysis that identifies four key
characteristics essential for the effective adaptation of RoPE to video, which
have not been fully considered in prior work. As part of our analysis, we
introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors)
task, which adds periodic distractors into V-NIAH. The V-NIAH-D task
demonstrates that previous RoPE variants, lacking appropriate temporal
dimension allocation, are easily misled by distractors. Based on our analysis,
we introduce VideoRoPE, with a 3D structure designed to
preserve spatio-temporal relationships. VideoRoPE features
low-frequency temporal allocation to mitigate periodic oscillations, a
diagonal layout to maintain spatial symmetry, and adjustable
temporal spacing to decouple temporal and spatial indexing. VideoRoPE
consistently surpasses previous RoPE variants, across diverse downstream tasks
such as long video retrieval, video understanding, and video hallucination. Our
code will be available at
https://github.com/Wiselnn570/VideoRoPE{https://github.com/Wiselnn570/VideoRoPE}.Summary
AI-Generated Summary