SALOVA:用於長格式影片分析中針對性檢索和路由的分段增強型長影片助手

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

November 25, 2024
作者: Junho Kim, Hyunjun Kim, Hosu Lee, Yong Man Ro
cs.AI

摘要

儘管大型多模型模型取得了進展,但將其應用於長且未經修剪的視頻內容仍然具有挑戰性,原因在於上下文長度的限制和大量的內存開銷。這些限制通常導致信息損失顯著,並降低模型響應的相關性。隨著網絡平台上視頻數據的指數級增長,理解長格式視頻對於推進泛化智能至關重要。在本文中,我們介紹了SALOVA:Segment-Augmented LOng Video Assistant,這是一個新穎的視頻-LLM框架,旨在通過有針對性的檢索過程增強對長視頻內容的理解。我們解決了實現這一目標的兩個主要挑戰:(i)我們提出了SceneWalk數據集,這是一個高質量的收藏,包含87.8K個長視頻,每個視頻在段落級別密集標註,以便模型捕捉場景的連續性並保持豐富的描述上下文。(ii)我們開發了強大的架構設計,集成了動態路由機制和時空投影機,以便根據用戶查詢有效地檢索和處理相關的視頻片段。我們的框架通過允許對查詢做出精確識別和檢索相關視頻片段,從而提高了生成響應的上下文相關性,從而減輕了當前視頻-LMM的限制。通過大量實驗,SALOVA展示了在處理複雜的長格式視頻方面的增強能力,顯示出在延長序列中保持上下文完整性的顯著能力。
English
Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.

Summary

AI-Generated Summary

PDF72November 27, 2024