ChatPaper.aiChatPaper

KOFFVQA:面向韩语大规模视觉-语言模型的客观评估自由形式视觉问答基准

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

March 31, 2025
作者: Yoonshik Kim, Jaeyoon Jung
cs.AI

摘要

近期,大规模视觉-语言模型(VLMs)的兴起催生了多种评估此类模型的不同基准。然而,我们注意到,现有的大多数评估方法存在一个共同问题:它们要么要求模型从预设的答案中选择,牺牲了开放性;要么依赖评判模型来评估回答,导致评估结果主观且不可靠。此外,我们发现针对韩语的VLMs评估基准较为匮乏,而这类基准作为独立于常见英语基准的评估指标是必要的,因为生成式语言模型的表现会因使用语言的不同而存在显著差异。为此,我们提出了KOFFVQA,一个面向韩语的通用型自由形式视觉问答基准,专门用于评估VLMs。该基准包含275道精心设计的问题,每道问题均配有一张图片及涵盖VLM性能10个不同方面的评分标准。通过预先设定的评分规则,评判模型能够对每个回答进行评分,从而解决了评估不可靠的问题。通过以客观方式定义评估标准,即使是小型开源模型也能在我们的基准上可靠地评估其他模型。除了在基准上评估了大量现有VLMs外,我们还通过实验验证了使用预先设定评分标准进行评估的方法比现有方法更为可靠。我们的评估代码已发布于https://github.com/maum-ai/KOFFVQA。
English
The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA

Summary

AI-Generated Summary

PDF42April 1, 2025