TP-Eval：通過定製提示來發揮多模態LLM的評估潛力

摘要

最近，多模式大型語言模型（MLLMs）因其印象深刻的能力而受到廣泛關注。對MLLMs的評估對於分析MLLMs的屬性並提供有價值的見解變得至關重要。然而，目前的基準忽略了提示敏感性問題 - 輕微的提示變化可能導致顯著的性能波動。因此，不當的提示可能掩蓋了模型的能力，低估了模型的性能。此外，不同模型對不同提示有不同偏好，因此，對所有模型使用相同提示將導致評估偏差。本文分析了現有基準中的這一缺陷，並進一步引入了一個名為TP-Eval的新評估框架，該框架引入了一種提示定制方法，以減少評估偏差並發揮模型的潛力。TP-Eval將為不同模型重新編寫原始提示，以獲得不同的定制提示。特別是，我們提出了一些針對MLLM評估情境量身定制的提示定制模塊。大量實驗證明了我們方法揭示模型能力的有效性，TP-Eval應有助於社區開發更全面和有說服力的MLLM評估基準。

English

Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity - minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models' capabilities, underestimating the models' performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models' potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models' capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.

TP-Eval：通過定製提示來發揮多模態LLM的評估潛力

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

摘要

Summary

Support

Support