VideoVista-CulturalLingo:360度視野——跨越文化、語言與領域的視頻理解
VideoVista-CulturalLingo: 360^circ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
April 23, 2025
作者: Xinyu Chen, Yunxin Li, Haoyuan Shi, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang
cs.AI
摘要
評估多模態AI系統的視頻理解能力,能有效衡量其理解與推理能力。現有的視頻評估基準大多局限於單一語言,通常為英語,且主要基於西方文化背景的視頻。本文介紹了VideoVista-CulturalLingo,這是首個旨在跨越文化、語言及領域鴻溝的視頻理解評估基準。我們的工作與現有基準有以下不同之處:1)文化多樣性,涵蓋中國、北美及歐洲文化;2)多語言性,問題以中文和英文這兩種最廣泛使用的語言呈現;3)廣泛領域,視頻來源於數百個人類創建的領域。VideoVista-CulturalLingo包含1,389個視頻和3,134個問答對,並已對24個近期開源或專有的視頻大模型進行了評估。從實驗結果中,我們觀察到:1)現有模型在中國中心問題上的表現遜於西方中心問題,尤其是涉及中國歷史的問題;2)當前開源模型在時間理解上仍存在局限,特別是在事件定位任務中,最高得分僅為45.2%;3)主流模型在一般科學問題上表現強勁,而開源模型在數學問題上表現較弱。
English
Assessing the video comprehension capabilities of multimodal AI systems can
effectively measure their understanding and reasoning abilities. Most video
evaluation benchmarks are limited to a single language, typically English, and
predominantly feature videos rooted in Western cultural contexts. In this
paper, we present VideoVista-CulturalLingo, the first video evaluation
benchmark designed to bridge cultural, linguistic, and domain divide in video
comprehension. Our work differs from existing benchmarks in the following ways:
1) Cultural diversity, incorporating cultures from China, North America, and
Europe; 2) Multi-linguistics, with questions presented in Chinese and
English-two of the most widely spoken languages; and 3) Broad domain, featuring
videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo
contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent
open-source or proprietary video large models. From the experiment results, we
observe that: 1) Existing models perform worse on Chinese-centric questions
than Western-centric ones, particularly those related to Chinese history; 2)
Current open-source models still exhibit limitations in temporal understanding,
especially in the Event Localization task, achieving a maximum score of only
45.2%; 3) Mainstream models demonstrate strong performance in general
scientific questions, while open-source models demonstrate weak performance in
mathematics.Summary
AI-Generated Summary