ChatPaper.aiChatPaper

WorldCuisines:一個大規模的基準測試,用於全球美食的多語言和多文化視覺問答。

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

October 16, 2024
作者: Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Ching Lam Cheng, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri, Garry Kuwanto, Hanyang Zhao, Haryo Akbarianto Wibowo, Holy Lovenia, Jan Christian Blaise Cruz, Jan Wira Gotama Putra, Junho Myung, Lucky Susanto, Maria Angelica Riera Machin, Marina Zhukova, Michael Anugraha, Muhammad Farid Adilazuarda, Natasha Santosa, Peerat Limkonchotiwat, Raj Dabre, Rio Alexander Audino, Samuel Cahyawijaya, Shi-Xiong Zhang, Stephanie Yulia Salim, Yi Zhou, Yinxuan Gui, David Ifeoluwa Adelani, En-Shiun Annie Lee, Shogo Okada, Ayu Purwarianti, Alham Fikri Aji, Taro Watanabe, Derry Tanti Wijaya, Alice Oh, Chong-Wah Ngo
cs.AI

摘要

視覺語言模型(VLMs)常常在文化特定知識方面遇到困難,特別是在英語以外的語言和少數文化背景中。為了評估它們對這類知識的理解,我們引入了WorldCuisines,這是一個大規模的多語言和多文化視覺語言理解基準。這個基準包括一個視覺問答(VQA)數據集,跨越30種語言和方言,涵蓋9個語言家族,包含超過100萬條數據,是迄今為止最大的多文化VQA基準。它包括識別菜名及其來源的任務。我們提供了兩個尺寸的評估數據集(12k和60k個實例),以及一個訓練數據集(100萬個實例)。我們的研究結果顯示,雖然VLMs在正確的位置上下文中表現更好,但在對抗性上下文和預測特定地區美食和語言方面表現不佳。為了支持未來的研究,我們釋出了一個帶有標註食品條目和圖像的知識庫,以及VQA數據。
English
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.

Summary

AI-Generated Summary

PDF333November 16, 2024