GIMMICK -- 글로벌 포용적 다중모드 다중작업 문화적 지식 벤치마킹

초록

대형 시각-언어 모델(LVLMs)은 최근 그 독보적인 성능과 광범위한 적용 가능성으로 주목받고 있습니다. 비서구적 맥락을 포함한 사용 시나리오에서 이들의 효용성이 부족하다는 점은 이전에 밝혀졌지만, 기존 연구들은 단일 작업에 대해 소수의 모델만을 평가하거나, 좁은 범위의 문화를 다루거나, 소수의 문화적 측면에만 초점을 맞추는 등 제한된 범위에 그쳤습니다. 전 세계적으로 포괄적인 LVLM 연구를 위해, 우리는 6개의 글로벌 대지역을 대표하는 144개 국가에 걸친 광범위한 문화적 지식을 평가하기 위해 설계된 종합적인 멀티모달 벤치마크인 GIMMICK을 소개합니다. GIMMICK은 728개의 독특한 문화적 사건 또는 측면을 기반으로 구축된 6개의 작업으로 구성되며, 여기서 우리는 5개의 독점 모델과 26개의 오픈 웨이트 모델을 포함한 총 20개의 LVLM과 11개의 LLM을 평가했습니다. 우리는 (1) 지역적 문화 편향, (2) 모델 크기의 영향, (3) 입력 양식, (4) 외부 단서를 체계적으로 조사했습니다. 우리의 분석은 모델과 작업 전반에 걸쳐 서구 문화에 대한 강한 편향을 드러내며, 모델 크기와 성능 간의 강한 상관관계와 멀티모달 입력 및 외부 지리적 단서의 효과를 강조합니다. 또한, 모델들은 무형의 측면(예: 의식)보다 유형의 측면(예: 음식)에 대한 지식이 더 풍부하며, 광범위한 문화적 기원을 인식하는 데는 뛰어나지만 더 미묘한 이해에는 어려움을 겪는 것으로 나타났습니다.

English

Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.

GIMMICK -- 글로벌 포용적 다중모드 다중작업 문화적 지식 벤치마킹

GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking

초록

Summary

Support