VideoLights:特徵細化和跨任務對齊Transformer,用於聯合視頻重點檢測和時刻檢索

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

December 2, 2024
作者: Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman
cs.AI

摘要

視頻精選檢測和時刻檢索(HD/MR)在視頻分析中至關重要。最近的聯合預測轉換器模型通常忽略了跨任務動態和視頻文本對齊與細化。此外,大多數模型通常使用有限的單向注意機制,導致集成表示薄弱並且在捕捉視頻和文本模態之間的相互依賴性方面表現不佳。儘管大型語言和視覺語言模型(LLM/LVLMs)在各個領域中日益受到重視,但它們在這一領域的應用相對較少被探索。在這裡,我們提出了VideoLights,一個新穎的HD/MR框架,通過以下方式解決這些限制:(i)具有對齊損失的卷積投影和特徵細化模塊,以實現更好的視頻文本特徵對齊,(ii)雙向跨模態融合網絡,用於強耦合的查詢感知片段表示,以及(iii)通過相關性增強兩個任務的單向聯合任務反饋機制。此外,(iv)我們引入了硬正/負損失,以適應性錯誤處罰和改進學習,以及(v)利用像BLIP-2這樣的LVLMs進行增強的多模態特徵集成和使用從LVLMs生成的合成數據進行智能預訓練。在QVHighlights、TVSum和Charades-STA基準測試上進行的全面實驗表明了最先進的性能。代碼和模型可在https://github.com/dpaul06/VideoLights 上找到。
English
Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most models typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively underexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addition, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance. Codes and models are available at https://github.com/dpaul06/VideoLights .

Summary

AI-Generated Summary

PDF42December 4, 2024