ChatPaper.aiChatPaper

多層次語言模型能理解中文圖像背後的深層含義嗎?

Can MLLMs Understand the Deep Implication Behind Chinese Images?

October 17, 2024
作者: Chenhao Zhang, Xi Feng, Yuelin Bai, Xinrun Du, Jinchang Hou, Kaixin Deng, Guangzeng Han, Qinrui Li, Bingli Wang, Jiaheng Liu, Xingwei Qu, Yifei Zhang, Qixuan Zhao, Yiming Liang, Ziqiang Liu, Feiteng Fang, Min Yang, Wenhao Huang, Chenghua Lin, Ge Zhang, Shiwen Ni
cs.AI

摘要

隨著多模態大型語言模型(MLLMs)的能力不斷提升,對於對MLLMs進行高階能力評估的需求也在增加。然而,目前對於評估MLLM在對中文視覺內容進行高階感知和理解方面的研究仍然不足。為了填補這一空白,我們引入了**中文圖像涵義理解基準測試**,簡稱**CII-Bench**,旨在評估MLLM對於中文圖像的高階感知和理解能力。與現有基準測試相比,CII-Bench在幾個方面具有獨特性。首先,為確保中文語境的真實性,CII-Bench中的圖像來自中文互聯網並經過人工審查,相應的答案也是人工製作的。此外,CII-Bench還包含代表中國傳統文化的圖像,如著名的中國傳統繪畫,這些圖像可以深刻反映模型對中國傳統文化的理解。通過對多個MLLM在CII-Bench上進行廣泛實驗,我們取得了重要發現。首先,觀察到MLLM在CII-Bench上的表現與人類之間存在顯著差距。MLLM的最高準確率為64.4%,而人類的平均準確率為78.2%,最高可達令人印象深刻的81.0%。隨後,MLLM在中國傳統文化圖像上表現較差,表明它們在理解高層語義方面存在局限性,並且缺乏對中國傳統文化的深入知識庫。最後,觀察到當圖像情感提示納入提示時,大多數模型的準確率有所提高。我們相信CII-Bench將使MLLM能夠更好地理解中文語義和中文特定圖像,推進邁向專家級人工通用智能(AGI)的旅程。我們的項目可在https://cii-bench.github.io/ 公開獲取。
English
As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at https://cii-bench.github.io/.

Summary

AI-Generated Summary

PDF112November 16, 2024