GIMMICK —— 全球包容性多模态多任务文化知识基准测试
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
February 19, 2025
作者: Florian Schneider, Carolin Holtermann, Chris Biemann, Anne Lauscher
cs.AI
摘要
大型视觉语言模型(LVLMs)近期因其卓越性能与广泛适用性而备受瞩目。尽管已有研究表明,在涉及非西方语境的使用场景中,这些模型的表现不尽如人意,但现有研究范围有限,仅覆盖了少数文化,专注于少量文化层面,或仅针对单一任务评估了有限数量的模型。为了推动全球包容性的LVLM研究,我们引入了GIMMICK,一个全面的多模态基准测试,旨在评估代表全球六大区域的144个国家中的广泛文化知识。GIMMICK包含基于三个新数据集的六项任务,涵盖了728个独特的文化事件或方面,我们在此基准上评估了20个LVLMs和11个LLMs,其中包括五个专有模型及26个不同规模的开源模型。我们系统性地考察了(1)区域文化偏见,(2)模型规模的影响,(3)输入模态,以及(4)外部提示。分析结果显示,所有模型和任务中均存在对西方文化的强烈偏见,并揭示了模型规模与性能之间的强相关性,以及多模态输入和外部地理提示的有效性。此外,我们发现模型对有形文化元素(如食物)的了解优于无形元素(如仪式),且在识别广泛文化起源方面表现出色,但在更细致入微的理解上则面临挑战。
English
Large Vision-Language Models (LVLMs) have recently gained attention due to
their distinctive performance and broad applicability. While it has been
previously shown that their efficacy in usage scenarios involving non-Western
contexts falls short, existing studies are limited in scope, covering just a
narrow range of cultures, focusing exclusively on a small number of cultural
aspects, or evaluating a limited selection of models on a single task only.
Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive
multimodal benchmark designed to assess a broad spectrum of cultural knowledge
across 144 countries representing six global macro-regions. GIMMICK comprises
six tasks built upon three new datasets that span 728 unique cultural events or
facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary
and 26 open-weight models of all sizes. We systematically examine (1) regional
cultural biases, (2) the influence of model size, (3) input modalities, and (4)
external cues. Our analyses reveal strong biases toward Western cultures across
models and tasks and highlight strong correlations between model size and
performance, as well as the effectiveness of multimodal input and external
geographic cues. We further find that models have more knowledge of tangible
than intangible aspects (e.g., food vs. rituals) and that they excel in
recognizing broad cultural origins but struggle with a more nuanced
understanding.Summary
AI-Generated Summary