ChatPaper.aiChatPaper

MMAU:一個龐大的多任務音訊理解和推理基準

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

October 24, 2024
作者: S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha
cs.AI

摘要

為了讓人工智慧代理能夠有效地與世界互動,理解音訊(包括語音、非語音聲音和音樂)的能力至關重要。我們提出了MMAU,一個新穎的基準測試,旨在評估多模態音訊理解模型在需要專家級知識和複雜推理的任務上的表現。MMAU包含了10,000個精心挑選的音訊片段,配對人類標註的自然語言問題和答案,涵蓋語音、環境聲音和音樂。它包含了信息提取和推理問題,需要模型展示跨越獨特且具挑戰性任務的27種不同技能。與現有基準測試不同,MMAU強調具有領域特定知識的高級感知和推理,挑戰模型應對類似專家面臨的任務。我們評估了18個開源和專有(大型)音訊-語言模型,展示了MMAU帶來的重大挑戰。值得注意的是,即使是最先進的Gemini Pro v1.5僅實現了52.97%的準確率,而最先進的開源Qwen2-Audio僅實現了52.50%,突顯了有很大的改進空間。我們相信MMAU將推動音訊和多模態研究社區開發更先進的音訊理解模型,能夠解決複雜的音訊任務。
English
The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

Summary

AI-Generated Summary

PDF202November 16, 2024