朝向強健的超詳細影像標題生成:多智能體方法與事實性和覆蓋範疇的雙重評估指標

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

December 20, 2024
作者: Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon
cs.AI

摘要

多模式大型語言模型(MLLMs)擅長生成高度詳細的標題,但常常出現幻覺。我們的分析顯示現有的幻覺檢測方法在處理詳細標題時遇到困難。我們認為這是由於隨著序列長度增加,MLLMs越來越依賴其生成的文本,而不是輸入圖像。為了解決這個問題,我們提出了一種多智能體方法,利用LLM-MLLM協作來糾正給定的標題。此外,我們引入了一個評估框架和一個基準數據集,以促進對詳細標題的系統分析。我們的實驗表明,我們提出的評估方法與人類對事實性的判斷更為一致,而現有的指標和方法改進MLLM事實性的效果在超詳細圖像標題任務中可能不盡理想。相反,我們提出的方法顯著提高了標題的事實準確性,甚至改進了由GPT-4V生成的標題。最後,我們通過展示MLLM在視覺問答基準測試中的表現可能與其生成詳細圖像標題的能力無關,突顯了以VQA為中心的基準測試的局限性。
English
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.

Summary

AI-Generated Summary

PDF142December 24, 2024