VLog:通过叙述生成检索的视频-语言模型词汇表
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
March 12, 2025
作者: Kevin Qinghong Lin, Mike Zheng Shou
cs.AI
摘要
人类日常活动可简洁地描述为视频流中一系列常规事件(如关闭闹钟)的序列,从而构成一个事件词汇表。受此启发,我们提出了VLog,一种新颖的视频理解框架,它将视频叙述定义为词汇,超越了现有生成式视频-语言模型中常见的子词词汇表。基于轻量级语言模型GPT-2,VLog具备三大创新点:(i) 生成式检索模型,融合了语言模型的复杂推理能力与对比检索的高效相似性搜索。(ii) 通过我们的叙述对编码算法从大规模视频叙述中提取的层次化词汇表,能够通过识别更广泛的场景(如厨房)及富有表现力的后缀(如用左手)来高效索引特定事件(如切番茄)。(iii) 利用生成模型扩展词汇表的策略,以应对推理过程中遇到的新事件。为验证我们的方法,我们引入了VidCap-Eval,一个需要包含推理关系(如之前与之后)的简洁叙述的开发集。在EgoSchema、COIN和HiREST上的实验进一步证明了VLog的有效性,展示了其生成简洁、上下文准确且高效叙述的能力,为视频理解提供了新的视角。代码已发布于https://github.com/showlab/VLog。
English
Human daily activities can be concisely narrated as sequences of routine
events (e.g., turning off an alarm) in video streams, forming an event
vocabulary. Motivated by this, we introduce VLog, a novel video understanding
framework that define video narrations as vocabulary, going beyond the typical
subword vocabularies in existing generative video-language models. Built on the
lightweight language model GPT-2, VLog feature three key innovations: (i) A
generative retrieval model, marrying language model's complex reasoning
capabilities with contrastive retrieval's efficient similarity search. (ii) A
hierarchical vocabulary derived from large-scale video narrations using our
narration pair encoding algorithm, enabling efficient indexing of specific
events (e.g., cutting a tomato) by identifying broader scenarios (e.g.,
kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary
update strategy leveraging generative models to extend the vocabulary for novel
events encountered during inference. To validate our approach, we introduce
VidCap-Eval, a development set requiring concise narrations with reasoning
relationships (e.g., before and after). Experiments on EgoSchema, COIN, and
HiREST further demonstrate the effectiveness of VLog, highlighting its ability
to generate concise, contextually accurate, and efficient narrations, offering
a novel perspective on video understanding. Codes are released at
https://github.com/showlab/VLog.Summary
AI-Generated Summary
论文概述
核心贡献
- 提出了VLog,一种新颖的视频理解框架,将视频叙述定义为词汇,超越了现有生成式视频-语言模型中的子词词汇。
- 引入了生成式检索模型,结合了语言模型的复杂推理能力和对比检索的高效相似性搜索。
- 提出了基于大规模视频叙述的分层词汇构建方法,通过叙述对编码算法实现高效索引。
- 设计了词汇更新策略,利用生成模型扩展推理过程中遇到的新事件词汇。
研究背景
- 现有视频-语言模型主要依赖于子词词汇,存在视觉解释性不足和推理速度慢的问题。
- 人类日常活动可以通过一系列常规事件(如关闭闹钟)来简洁叙述,形成事件词汇。
关键词
- 视频理解
- 生成式检索
- 叙述词汇
- 分层索引
- 词汇更新
背景
研究空白
- 现有视频-语言模型在处理长视频时存在推理速度慢的问题,且子词词汇缺乏视觉解释性。
技术挑战
- 如何在保持高效检索的同时,支持复杂的推理任务。
- 如何构建一个能够覆盖广泛事件且易于索引的叙述词汇。
先前方法
- 生成式模型:具有复杂推理能力,但推理速度慢。
- 检索模型:检索速度快,但缺乏复杂推理能力。
方法论
技术架构
- 基于轻量级语言模型GPT-2,结合对比视觉-文本模型SigLIP,构建生成式检索架构。
- 引入检索标记,将视觉和查询信息嵌入到推理过程中。
实现细节
- 使用叙述对编码算法从现有视频叙述数据集中生成前缀和后缀集合。
- 采用分层索引策略,通过场景识别快速缩小搜索范围,再通过后缀细化搜索结果。
创新点
- 生成式检索模型:结合了生成模型和检索模型的优势。
- 分层词汇构建:通过场景识别和叙述对编码实现高效索引。
- 词汇更新策略:利用生成模型扩展新事件词汇。
结果
实验设置
- 在VidCap-Eval、EgoSchema、COIN和HiREST等数据集上进行实验验证。
关键发现
- VLog在生成简洁、上下文准确的叙述方面表现出色,推理速度显著提升。
- 生成式检索模型在复杂推理任务中优于纯生成模型和纯检索模型。
- 分层索引策略显著提高了词汇检索的效率。
局限性
- VLog的词汇覆盖范围受限于预定义的词汇表,未来将探索如何扩展到更多领域。
结论
- VLog通过引入生成式检索模型、分层词汇构建和词汇更新策略,提供了一种高效、准确的视频理解框架。
- 实验结果表明,VLog在生成简洁、上下文准确的叙述方面具有显著优势,适用于实时视频处理。
1比特LLM时代:所有大型语言模型均为1.58比特。The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
1比特LLM时代:所有大型语言模型均为1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei•Feb 27, 2024•608142
Qwen2.5 技术报告Qwen2.5 Technical Report
Qwen2.5 技术报告
Qwen2.5 Technical Report
Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu•Dec 19, 2024•3459
DeepSeek-R1:通过强化学习激励LLMs中的推理能力DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-R1:通过强化学习激励LLMs中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, Zhen Zhang•Jan 22, 2025•3194