JMMMU：一個針對文化意識評估的日本大規模多學科多模態理解基準。

摘要

加速非英語語言中大型多模型（LMMs）的研究對於增強更廣泛人群的用戶體驗至關重要。本文介紹了 JMMMU（日本 MMMU），這是第一個大規模的日文基準，旨在根據日本文化背景設計，以評估專家級任務上 LMMs 的表現。為促進全面的文化感知評估，JMMMU 包括兩個互補的子集：（i）文化無關（CA）子集，選擇並將與文化無關的主題（例如數學）翻譯成日文，從而實現與其英文對應 MMMU 的一對一比較；以及（ii）文化特定（CS）子集，包括反映日本文化背景的新創建主題。使用 CA 子集，我們觀察到許多 LMMs 在日文評估時表現下降，這純粹歸因於語言變化。使用 CS 子集，我們揭示了它們對日本文化的不足理解。此外，通過結合兩個子集，我們確定一些 LMMs 在 CA 子集上表現良好，但在 CS 子集上表現不佳，顯示對日語的理解較為膚淺，缺乏文化理解的深度。我們希望這項工作不僅有助於提升 LMM 在日文中的性能，還可作為創建高標準、文化多元的多語言 LMM 發展基準的指南。該項目頁面為 https://mmmu-japanese-benchmark.github.io/JMMMU/。

English

Accelerating research on Large Multimodal Models (LMMs) in non-English languages is crucial for enhancing user experiences across broader populations. In this paper, we introduce JMMMU (Japanese MMMU), the first large-scale Japanese benchmark designed to evaluate LMMs on expert-level tasks based on the Japanese cultural context. To facilitate comprehensive culture-aware evaluation, JMMMU features two complementary subsets: (i) culture-agnostic (CA) subset, where the culture-independent subjects (e.g., Math) are selected and translated into Japanese, enabling one-to-one comparison with its English counterpart MMMU; and (ii) culture-specific (CS) subset, comprising newly crafted subjects that reflect Japanese cultural context. Using the CA subset, we observe performance drop in many LMMs when evaluated in Japanese, which is purely attributable to language variation. Using the CS subset, we reveal their inadequate Japanese cultural understanding. Further, by combining both subsets, we identify that some LMMs perform well on the CA subset but not on the CS subset, exposing a shallow understanding of the Japanese language that lacks depth in cultural understanding. We hope this work will not only help advance LMM performance in Japanese but also serve as a guideline to create high-standard, culturally diverse benchmarks for multilingual LMM development. The project page is https://mmmu-japanese-benchmark.github.io/JMMMU/.

JMMMU：一個針對文化意識評估的日本大規模多學科多模態理解基準。

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

摘要

Summary

Support

Support