ChatPaper.aiChatPaper

多模態的詛咒:評估大型多模態模型在語言、視覺和音訊方面的幻覺

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

October 16, 2024
作者: Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing
cs.AI

摘要

近年來,大型多模型模型(LMMs)的最新進展顯著提升了在各種任務中的表現,並持續努力進一步整合額外的模態,如視頻和音頻。然而,大多數現有的LMMs仍然容易出現幻覺,即事實多模輸入與生成的文本輸出之間的差異,這限制了它們在各種實際場景中的應用。本文首次系統地探討了涉及語言、視覺和音頻三種最常見模態的LMMs中的幻覺。我們的研究揭示了兩個導致幻覺的關鍵因素:對單模先驗的過度依賴和虛假的跨模態相關性。為應對這些挑戰,我們引入了基準The Curse of Multi-Modalities(CMM),全面評估LMMs中的幻覺,提供對其潛在問題的詳細分析。我們的研究結果突顯了關鍵的脆弱性,包括模態整合的不平衡和訓練數據中的偏見,強調了對平衡的跨模態學習和增強的幻覺緩解策略的需求。根據我們的觀察和研究結果,我們提出了可能增強LMMs可靠性的研究方向。
English
Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of their underlying issues. Our findings highlight key vulnerabilities, including imbalances in modality integration and biases from training data, underscoring the need for balanced cross-modal learning and enhanced hallucination mitigation strategies. Based on our observations and findings, we suggest potential research directions that could enhance the reliability of LMMs.

Summary

AI-Generated Summary

PDF322November 16, 2024