Chapter-Llama:利用大语言模型实现小时级视频的高效章节划分
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
March 31, 2025
作者: Lucas Ventura, Antoine Yang, Cordelia Schmid, Gül Varol
cs.AI
摘要
我们致力于解决视频章节划分任务,即将长视频时间线分割为语义单元并生成相应的章节标题。尽管自动章节划分的研究相对较少,但它具有提升长视频导航与内容检索效率的潜力。本文中,我们通过“Chapter-Llama”框架在文本领域高效处理这一问题,在长达一小时的视频上实现了卓越的章节划分性能。具体而言,我们利用了一个具备大上下文窗口的预训练大语言模型(LLM),并输入(i)语音转录文本和(ii)描述视频帧的字幕,以及它们各自的时间戳。鉴于为所有帧详尽添加字幕的低效性,我们提出了一种基于语音转录内容的轻量级语音引导帧选择策略,并通过实验展示了其显著优势。我们训练LLM输出章节边界的时间戳以及自由格式的章节标题。这一简洁而强大的方法能够单次前向传播处理长达一小时的视频。我们的成果在最新的VidChapters-7M基准测试上展现了显著提升(例如,F1分数从26.7提升至45.3)。为促进进一步研究,我们在项目页面上公开了代码和模型。
English
We address the task of video chaptering, i.e., partitioning a long video
timeline into semantic units and generating corresponding chapter titles. While
relatively underexplored, automatic chaptering has the potential to enable
efficient navigation and content retrieval in long-form videos. In this paper,
we achieve strong chaptering performance on hour-long videos by efficiently
addressing the problem in the text domain with our 'Chapter-Llama' framework.
Specifically, we leverage a pretrained large language model (LLM) with large
context window, and feed as input (i) speech transcripts and (ii) captions
describing video frames, along with their respective timestamps. Given the
inefficiency of exhaustively captioning all frames, we propose a lightweight
speech-guided frame selection strategy based on speech transcript content, and
experimentally demonstrate remarkable advantages. We train the LLM to output
timestamps for the chapter boundaries, as well as free-form chapter titles.
This simple yet powerful approach scales to processing one-hour long videos in
a single forward pass. Our results demonstrate substantial improvements (e.g.,
45.3 vs 26.7 F1 score) over the state of the art on the recent VidChapters-7M
benchmark. To promote further research, we release our code and models at our
project page.Summary
AI-Generated Summary