MMAU:一个大规模多任务音频理解和推理基准。
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
October 24, 2024
作者: S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha
cs.AI
摘要
理解音频的能力——包括语音、非语音声音和音乐——对于人工智能代理与世界有效交互至关重要。我们提出了MMAU,这是一个新颖的基准,旨在评估多模态音频理解模型在需要专业级知识和复杂推理的任务上的表现。MMAU包括10k精心策划的音频片段,配对人类注释的自然语言问题和答案,涵盖语音、环境声音和音乐。它包括信息提取和推理问题,要求模型展示27种不同技能,涵盖独特且具有挑战性的任务。与现有基准不同,MMAU强调具有领域特定知识的高级感知和推理,挑战模型解决类似专家面临的任务。我们评估了18个开源和专有(大型)音频-语言模型,展示了MMAU带来的重大挑战。值得注意的是,即使是最先进的Gemini Pro v1.5也仅达到52.97%的准确率,而最先进的开源Qwen2-Audio仅达到52.50%,突显了改进的巨大空间。我们相信MMAU将推动音频和多模态研究社区开发更先进的音频理解模型,能够解决复杂的音频任务。
English
The ability to comprehend audio--which includes speech, non-speech sounds,
and music--is crucial for AI agents to interact effectively with the world. We
present MMAU, a novel benchmark designed to evaluate multimodal audio
understanding models on tasks requiring expert-level knowledge and complex
reasoning. MMAU comprises 10k carefully curated audio clips paired with
human-annotated natural language questions and answers spanning speech,
environmental sounds, and music. It includes information extraction and
reasoning questions, requiring models to demonstrate 27 distinct skills across
unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes
advanced perception and reasoning with domain-specific knowledge, challenging
models to tackle tasks akin to those faced by experts. We assess 18 open-source
and proprietary (Large) Audio-Language Models, demonstrating the significant
challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5
achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio
achieves only 52.50%, highlighting considerable room for improvement. We
believe MMAU will drive the audio and multimodal research community to develop
more advanced audio understanding models capable of solving complex audio
tasks.Summary
AI-Generated Summary