OVO-Bench:您的视频LLMs与真实世界在线视频理解有多远?
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
January 9, 2025
作者: Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang
cs.AI
摘要
时间感知是离线和在线视频LLMs之间的关键区别,它指的是根据问题提出时的时间戳进行动态推理的能力。与依赖完整视频进行静态事后分析的离线模型不同,在线模型会逐步处理视频流,并根据问题提出时的时间戳动态调整其响应。尽管时间感知具有重要意义,但现有基准测试并未充分评估这一点。为填补这一空白,我们提出了OVO-Bench(Online-VideO-Benchmark),这是一个强调时间戳对于评估在线视频理解能力的重要性的新型视频基准测试。OVO-Bench评估视频LLMs根据三种不同场景在特定时间戳发生的事件进行推理和响应的能力:(1)向后追溯:追溯到过去的事件以回答问题。(2)实时理解:理解并响应当前时间戳发生的事件。(3)向前主动响应:延迟响应,直到有足够的未来信息可用以准确回答问题。OVO-Bench包括12个任务,涵盖644个独特视频和约人工策划的2,800个精细的元注释,具有精确的时间戳。我们结合自动化生成流水线和人工策划。借助这些高质量样本,我们进一步开发了一个评估流水线,以系统地查询视频LLMs沿视频时间轴。对九个视频LLMs的评估显示,尽管在传统基准测试上取得了进展,但当前模型在在线视频理解方面仍存在困难,与人类代理相比存在显著差距。我们希望OVO-Bench能推动视频LLMs的进展,并激发未来在线视频推理研究。我们的基准测试和代码可在https://github.com/JoeLeelyf/OVO-Bench 上访问。
English
Temporal Awareness, the ability to reason dynamically based on the timestamp
when a question is raised, is the key distinction between offline and online
video LLMs. Unlike offline models, which rely on complete videos for static,
post hoc analysis, online models process video streams incrementally and
dynamically adapt their responses based on the timestamp at which the question
is posed. Despite its significance, temporal awareness has not been adequately
evaluated in existing benchmarks. To fill this gap, we present OVO-Bench
(Online-VideO-Benchmark), a novel video benchmark that emphasizes the
importance of timestamps for advanced online video understanding capability
benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and
respond to events occurring at specific timestamps under three distinct
scenarios: (1) Backward tracing: trace back to past events to answer the
question. (2) Real-time understanding: understand and respond to events as they
unfold at the current timestamp. (3) Forward active responding: delay the
response until sufficient future information becomes available to answer the
question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos
and approximately human-curated 2,800 fine-grained meta-annotations with
precise timestamps. We combine automated generation pipelines with human
curation. With these high-quality samples, we further developed an evaluation
pipeline to systematically query video LLMs along the video timeline.
Evaluations of nine Video-LLMs reveal that, despite advancements on traditional
benchmarks, current models struggle with online video understanding, showing a
significant gap compared to human agents. We hope OVO-Bench will drive progress
in video LLMs and inspire future research in online video reasoning. Our
benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.Summary
AI-Generated Summary