长视频理解的时间偏好优化
Temporal Preference Optimization for Long-Form Video Understanding
January 23, 2025
作者: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
cs.AI
摘要
尽管视频大型多模型(video-LMMs)取得了显著进展,但对于长视频的有效时间定位仍然是现有模型面临的挑战。为了解决这一局限性,我们提出了时间偏好优化(Temporal Preference Optimization,TPO)——一种新颖的后训练框架,旨在通过偏好学习增强视频-LMMs的时间定位能力。TPO采用自训练方法,使模型能够通过利用两个粒度的筛选偏好数据集来区分良好定位和不太准确的时间响应:局部时间定位,侧重于特定视频片段;全面时间定位,捕捉整个视频序列中的扩展时间依赖关系。通过在这些偏好数据集上进行优化,TPO显著增强了时间理解能力,同时减少了对手动注释数据的依赖。在三个长视频理解基准测试上进行的大量实验——LongVideoBench、MLVU和Video-MME,展示了TPO在两种最先进的视频-LMMs上的有效性。值得注意的是,LLaVA-Video-TPO在Video-MME基准测试中确立了自己作为领先的7B模型的地位,突显了TPO作为推动长视频理解中时间推理的可扩展和高效解决方案的潜力。项目页面:https://ruili33.github.io/tpo_website。
English
Despite significant advancements in video large multimodal models
(video-LMMs), achieving effective temporal grounding in long-form videos
remains a challenge for existing models. To address this limitation, we propose
Temporal Preference Optimization (TPO), a novel post-training framework
designed to enhance the temporal grounding capabilities of video-LMMs through
preference learning. TPO adopts a self-training approach that enables models to
differentiate between well-grounded and less accurate temporal responses by
leveraging curated preference datasets at two granularities: localized temporal
grounding, which focuses on specific video segments, and comprehensive temporal
grounding, which captures extended temporal dependencies across entire video
sequences. By optimizing on these preference datasets, TPO significantly
enhances temporal understanding while reducing reliance on manually annotated
data. Extensive experiments on three long-form video understanding
benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness
of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO
establishes itself as the leading 7B model on the Video-MME benchmark,
underscoring the potential of TPO as a scalable and efficient solution for
advancing temporal reasoning in long-form video understanding. Project page:
https://ruili33.github.io/tpo_website.Summary
AI-Generated Summary