TP-Eval: 사용자 지정 프롬프트를 통해 다중 모달 LLM의 평가 잠재력 활용하기

초록

최근에는 다중 모달 대형 언어 모델(Multimodal Large Language Models, MLLMs)이 놀라운 능력으로 많은 관심을 받고 있습니다. MLLM의 평가는 MLLM의 특성을 분석하고 유용한 통찰을 제공하는 데 중요해지고 있습니다. 그러나 현재의 벤치마크는 프롬프트 민감도 문제를 간과하고 있습니다 - 작은 프롬프트 변화가 성능의 상당한 변동을 일으킬 수 있습니다. 따라서 부적절한 프롬프트는 모델의 능력을 흐리게 하고 모델의 성능을 과소평가할 수 있습니다. 게다가, 서로 다른 모델은 서로 다른 프롬프트를 선호하며, 따라서 모든 모델에 동일한 프롬프트를 사용하는 것은 평가 편향을 일으킬 수 있습니다. 본 논문은 기존 벤치마크의 이 결함을 분석하고, 평가 편향을 줄이고 모델의 잠재력을 활용하기 위한 새로운 평가 프레임워크인 TP-Eval을 소개합니다. TP-Eval은 원래 프롬프트를 다른 모델에 대해 다양한 맞춤형 프롬프트로 재작성합니다. 특히, MLLM 평가 시나리오에 맞게 설계된 프롬프트 맞춤형을 위한 몇 가지 잘 설계된 모듈을 제안합니다. 광범위한 실험은 우리의 접근 방식이 모델의 능력을 발굴하는 데 효과적임을 입증하며, TP-Eval은 더 포괄적이고 설득력 있는 MLLM 평가 벤치마크를 개발하는 데 커뮤니티에 도움이 될 것입니다.

English

Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity - minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models' capabilities, underestimating the models' performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models' potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models' capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.

TP-Eval: 사용자 지정 프롬프트를 통해 다중 모달 LLM의 평가 잠재력 활용하기

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

초록

Support