大型語言模型能否助力多模態語言分析?MMLA:一個全面性的基準測試
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
April 23, 2025
作者: Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, Jinchao Zhang
cs.AI
摘要
多模態語言分析是一個快速發展的領域,它利用多種模態來增強對人類對話語句中高層次語義的理解。儘管其重要性不言而喻,但鮮有研究探討多模態大語言模型(MLLMs)在理解認知層面語義方面的能力。本文中,我們引入了MMLA,這是一個專門為填補這一空白而設計的綜合基準。MMLA包含了超過61K條來自模擬與真實場景的多模態語句,涵蓋了多模態語義的六個核心維度:意圖、情感、對話行為、情感傾向、說話風格和溝通行為。我們通過三種方法評估了八個主流的大語言模型和多模態大語言模型分支:零樣本推理、監督微調和指令微調。大量實驗表明,即便是經過微調的模型,其準確率也僅在60%~70%之間,這凸顯了當前MLLMs在理解複雜人類語言方面的侷限性。我們相信,MMLA將為探索大語言模型在多模態語言分析中的潛力奠定堅實基礎,並為推動這一領域的發展提供寶貴資源。數據集和代碼已在https://github.com/thuiar/MMLA開源。
English
Multimodal language analysis is a rapidly evolving field that leverages
multiple modalities to enhance the understanding of high-level semantics
underlying human conversational utterances. Despite its significance, little
research has investigated the capability of multimodal large language models
(MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce
MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA
comprises over 61K multimodal utterances drawn from both staged and real-world
scenarios, covering six core dimensions of multimodal semantics: intent,
emotion, dialogue act, sentiment, speaking style, and communication behavior.
We evaluate eight mainstream branches of LLMs and MLLMs using three methods:
zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive
experiments reveal that even fine-tuned models achieve only about 60%~70%
accuracy, underscoring the limitations of current MLLMs in understanding
complex human language. We believe that MMLA will serve as a solid foundation
for exploring the potential of large language models in multimodal language
analysis and provide valuable resources to advance this field. The datasets and
code are open-sourced at https://github.com/thuiar/MMLA.Summary
AI-Generated Summary