AV-Odyssey 實驗平台:您的多模態語言模型真的能理解視聽資訊嗎?
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
December 3, 2024
作者: Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, Xiangyu Yue
cs.AI
摘要
最近,多模式大型語言模型(MLLMs),如GPT-4o、Gemini 1.5 Pro和Reka Core,已擴展其功能以包括視覺和音訊模式。儘管這些模型在各種視聽應用中展現出令人印象深刻的性能,但我們提出的DeafTest顯示,MLLMs常常在人類認為微不足道的簡單任務上遇到困難:1)確定兩個聲音中哪個更大聲,以及2)確定兩個聲音中哪個音調更高。受到這些觀察的啟發,我們引入了AV-Odyssey Bench,這是一個全面的音視覺基準,旨在評估這些MLLMs是否真正理解音視覺信息。該基準包含了4555個精心設計的問題,每個問題都包含文本、視覺和音訊元素。為了成功推斷答案,模型必須有效地利用來自視覺和音訊輸入的線索。為確保對MLLM回答的精確和客觀評估,我們將問題設計為多選項,消除了對人類評估或LLM輔助評估的需求。我們對一系列封閉源和開源模型進行基準測試並總結觀察結果。通過揭示當前模型的限制,我們旨在為未來數據集收集和模型開發提供有用的見解。
English
Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini
1.5 Pro, and Reka Core, have expanded their capabilities to include vision and
audio modalities. While these models demonstrate impressive performance across
a wide range of audio-visual applications, our proposed DeafTest reveals that
MLLMs often struggle with simple tasks humans find trivial: 1) determining
which of two sounds is louder, and 2) determining which of two sounds has a
higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a
comprehensive audio-visual benchmark designed to assess whether those MLLMs can
truly understand the audio-visual information. This benchmark encompasses 4,555
carefully crafted problems, each incorporating text, visual, and audio
components. To successfully infer answers, models must effectively leverage
clues from both visual and audio inputs. To ensure precise and objective
evaluation of MLLM responses, we have structured the questions as
multiple-choice, eliminating the need for human evaluation or LLM-assisted
assessment. We benchmark a series of closed-source and open-source models and
summarize the observations. By revealing the limitations of current models, we
aim to provide useful insight for future dataset collection and model
development.Summary
AI-Generated Summary