VideoLights:特徵細化和跨任務對齊Transformer,用於聯合視頻重點檢測和時刻檢索
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
December 2, 2024
作者: Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman
cs.AI
摘要
視頻精選檢測和時刻檢索(HD/MR)在視頻分析中至關重要。最近的聯合預測轉換器模型通常忽略了跨任務動態和視頻文本對齊與細化。此外,大多數模型通常使用有限的單向注意機制,導致集成表示薄弱並且在捕捉視頻和文本模態之間的相互依賴性方面表現不佳。儘管大型語言和視覺語言模型(LLM/LVLMs)在各個領域中日益受到重視,但它們在這一領域的應用相對較少被探索。在這裡,我們提出了VideoLights,一個新穎的HD/MR框架,通過以下方式解決這些限制:(i)具有對齊損失的卷積投影和特徵細化模塊,以實現更好的視頻文本特徵對齊,(ii)雙向跨模態融合網絡,用於強耦合的查詢感知片段表示,以及(iii)通過相關性增強兩個任務的單向聯合任務反饋機制。此外,(iv)我們引入了硬正/負損失,以適應性錯誤處罰和改進學習,以及(v)利用像BLIP-2這樣的LVLMs進行增強的多模態特徵集成和使用從LVLMs生成的合成數據進行智能預訓練。在QVHighlights、TVSum和Charades-STA基準測試上進行的全面實驗表明了最先進的性能。代碼和模型可在https://github.com/dpaul06/VideoLights 上找到。
English
Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video
analysis. Recent joint prediction transformer models often overlook their
cross-task dynamics and video-text alignment and refinement. Moreover, most
models typically use limited, uni-directional attention mechanisms, resulting
in weakly integrated representations and suboptimal performance in capturing
the interdependence between video and text modalities. Although large-language
and vision-language models (LLM/LVLMs) have gained prominence across various
domains, their application in this field remains relatively underexplored. Here
we propose VideoLights, a novel HD/MR framework addressing these limitations
through (i) Convolutional Projection and Feature Refinement modules with an
alignment loss for better video-text feature alignment, (ii) Bi-Directional
Cross-Modal Fusion network for strongly coupled query-aware clip
representations, and (iii) Uni-directional joint-task feedback mechanism
enhancing both tasks through correlation. In addition, (iv) we introduce hard
positive/negative losses for adaptive error penalization and improved learning,
and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration
and intelligent pretraining using synthetic data generated from LVLMs.
Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks
demonstrate state-of-the-art performance. Codes and models are available at
https://github.com/dpaul06/VideoLights .Summary
AI-Generated Summary