TP-Eval:通過定製提示來發揮多模態LLM的評估潛力
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
October 23, 2024
作者: Yuxuan Xie, Tianhua Li, Wenqi Shao, Kaipeng Zhang
cs.AI
摘要
最近,多模式大型語言模型(MLLMs)因其印象深刻的能力而受到廣泛關注。對MLLMs的評估對於分析MLLMs的屬性並提供有價值的見解變得至關重要。然而,目前的基準忽略了提示敏感性問題 - 輕微的提示變化可能導致顯著的性能波動。因此,不當的提示可能掩蓋了模型的能力,低估了模型的性能。此外,不同模型對不同提示有不同偏好,因此,對所有模型使用相同提示將導致評估偏差。本文分析了現有基準中的這一缺陷,並進一步引入了一個名為TP-Eval的新評估框架,該框架引入了一種提示定制方法,以減少評估偏差並發揮模型的潛力。TP-Eval將為不同模型重新編寫原始提示,以獲得不同的定制提示。特別是,我們提出了一些針對MLLM評估情境量身定制的提示定制模塊。大量實驗證明了我們方法揭示模型能力的有效性,TP-Eval應有助於社區開發更全面和有說服力的MLLM評估基準。
English
Recently, multimodal large language models (MLLMs) have received much
attention for their impressive capabilities. The evaluation of MLLMs is
becoming critical to analyzing attributes of MLLMs and providing valuable
insights. However, current benchmarks overlook the problem of prompt
sensitivity - minor prompt variations may lead to significant performance
fluctuations. Thus, inappropriate prompts may obscure the models' capabilities,
underestimating the models' performance. Moreover, different models have
different preferences for different prompts, and thus, using the same prompt
for all models will cause evaluation bias. This paper analyzes this deficiency
in existing benchmarks and further introduces a new evaluation framework named
TP-Eval, which introduces a prompt customization method to reduce evaluation
biases and tap models' potential. TP-Eval will rewrite the original prompts to
different customized prompts for different models. In particular, we propose
some well-designed modules for prompt customization tailored to the scenario of
MLLM evaluation. Extensive experiments demonstrate the effectiveness of our
approach to uncovering models' capabilities, and TP-Eval should benefit the
community in developing more comprehensive and convincing MLLM evaluation
benchmarks.Summary
AI-Generated Summary