朝向稳健的超详细图像描述:多智能体方法和事实性与覆盖范围的双重评估指标

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

December 20, 2024
作者: Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon
cs.AI

摘要

多模态大型语言模型(MLLMs)擅长生成高度详细的标题,但往往会产生幻觉。我们的分析揭示了现有的幻觉检测方法在处理详细标题时存在困难。我们将这归因于随着序列长度增长,MLLMs越来越依赖于它们生成的文本,而不是输入图像。为了解决这个问题,我们提出了一种多代理方法,利用LLM-MLLM协作来纠正给定的标题。此外,我们引入了一个评估框架和一个基准数据集,以促进对详细标题的系统分析。我们的实验证明,我们提出的评估方法与人类对事实的判断更加吻合,而现有的度量标准以及改善MLLM事实性的方法可能在超详细图像标题任务中表现不佳。相比之下,我们提出的方法显著提高了标题的事实准确性,甚至改善了由GPT-4V生成的标题。最后,我们通过展示MLLM在VQA基准测试上的表现可能与其生成详细图像标题的能力不相关,突出了以VQA为中心的基准测试的局限性。
English
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.

Summary

AI-Generated Summary

PDF142December 24, 2024