SALOVA：用于长视频分析中的目标检索和路由的分段增强长视频助手

摘要

尽管大型多模态模型取得了进展，但将其应用于长时间未经修剪的视频内容仍然具有挑战性，这是由于上下文长度的限制和大量的内存开销。这些约束通常会导致信息严重丢失，并降低模型响应的相关性。随着网络平台上视频数据的指数增长，理解长视频对于推动普适智能至关重要。在本文中，我们介绍了SALOVA：Segment-Augmented LOng Video Assistant，这是一种新颖的视频-LLM框架，旨在通过有针对性的检索过程增强对长视频内容的理解。我们解决了实现这一目标的两个主要挑战：(i) 我们提出了SceneWalk数据集，这是一个高质量的长视频集合，每个视频都在片段级别进行了密集字幕处理，以便模型捕捉场景连续性并保持丰富的描述性上下文。(ii) 我们开发了强大的架构设计，集成了动态路由机制和时空投影仪，以便根据用户查询高效地检索和处理相关视频片段。我们的框架通过允许对查询做出精确识别和检索相关视频片段来减轻当前视频-LLM的限制，从而提高生成响应的上下文相关性。通过大量实验，SALOVA展示了在处理复杂长视频方面的增强能力，显示出在扩展序列中保持上下文完整性的显著能力。

English

Despite advances in Large Multi-modal Models, applying them to long and untrimmed video content remains challenging due to limitations in context length and substantial memory overhead. These constraints often lead to significant information loss and reduced relevance in the model responses. With the exponential growth of video data across web platforms, understanding long-form video is crucial for advancing generalized intelligence. In this paper, we introduce SALOVA: Segment-Augmented LOng Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content through targeted retrieval process. We address two main challenges to achieve it: (i) We present the SceneWalk dataset, a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich descriptive context. (ii) We develop robust architectural designs integrating dynamic routing mechanism and spatio-temporal projector to efficiently retrieve and process relevant video segments based on user queries. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries, thereby improving the contextual relevance of the generated responses. Through extensive experiments, SALOVA demonstrates enhanced capability in processing complex long-form videos, showing significant capability to maintain contextual integrity across extended sequences.

SALOVA：用于长视频分析中的目标检索和路由的分段增强长视频助手

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

摘要

Summary

Support

Support