OVO-Bench:您的視頻LLMs與真實世界在線視頻理解有多遠?

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

January 9, 2025
作者: Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang
cs.AI

摘要

時間感知是離線和在線視頻LLMs之間的關鍵區別,它指的是根據提問時戳記動態推理的能力。與依賴完整視頻進行靜態事後分析的離線模型不同,在線模型會逐步處理視頻流,並根據提問時的時間戳記動態調整其回答。儘管時間感知具有重要意義,但現有基準對其評估不足。為填補這一空白,我們提出了OVO-Bench(Online-VideO-Benchmark),這是一個強調時間戳記對於先進在線視頻理解能力基準評估的新型視頻基準。OVO-Bench評估了視頻LLMs根據三種不同情境在特定時間戳記下推理和回應事件的能力:(1)向後追踪:追溯到過去事件以回答問題。(2)實時理解:理解並回應當前時間戳記下正在發生的事件。(3)向前主動回應:延遲回應,直到有足夠的未來信息可準確回答問題。OVO-Bench包括12個任務,涵蓋644個獨特視頻和約2800個精細的元注釋,具有精確的時間戳記,由人工精心編輯。我們結合自動生成流程和人工編輯。通過這些高質量樣本,我們進一步開發了一個評估流程,以系統地查詢視頻LLMs沿著視頻時間軸。對九個視頻LLMs的評估顯示,儘管在傳統基準上取得了進展,但當前模型在在線視頻理解方面仍存在困難,與人類代理相比存在顯著差距。我們希望OVO-Bench將推動視頻LLMs的進步,激發未來在線視頻推理研究。我們的基準和代碼可在https://github.com/JoeLeelyf/OVO-Bench上訪問。
English
Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.

Summary

AI-Generated Summary

PDF352January 13, 2025