VLog：通过叙述生成检索的视频-语言模型词汇表

摘要

人类日常活动可简洁地描述为视频流中一系列常规事件（如关闭闹钟）的序列，从而构成一个事件词汇表。受此启发，我们提出了VLog，一种新颖的视频理解框架，它将视频叙述定义为词汇，超越了现有生成式视频-语言模型中常见的子词词汇表。基于轻量级语言模型GPT-2，VLog具备三大创新点：(i) 生成式检索模型，融合了语言模型的复杂推理能力与对比检索的高效相似性搜索。(ii) 通过我们的叙述对编码算法从大规模视频叙述中提取的层次化词汇表，能够通过识别更广泛的场景（如厨房）及富有表现力的后缀（如用左手）来高效索引特定事件（如切番茄）。(iii) 利用生成模型扩展词汇表的策略，以应对推理过程中遇到的新事件。为验证我们的方法，我们引入了VidCap-Eval，一个需要包含推理关系（如之前与之后）的简洁叙述的开发集。在EgoSchema、COIN和HiREST上的实验进一步证明了VLog的有效性，展示了其生成简洁、上下文准确且高效叙述的能力，为视频理解提供了新的视角。代码已发布于https://github.com/showlab/VLog。

English

Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog feature three key innovations: (i) A generative retrieval model, marrying language model's complex reasoning capabilities with contrastive retrieval's efficient similarity search. (ii) A hierarchical vocabulary derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary update strategy leveraging generative models to extend the vocabulary for novel events encountered during inference. To validate our approach, we introduce VidCap-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations, offering a novel perspective on video understanding. Codes are released at https://github.com/showlab/VLog.

VLog：通过叙述生成检索的视频-语言模型词汇表

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

摘要

Summary

Support