DynaMath:用于评估视觉语言模型数学推理鲁棒性的动态视觉基准
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
October 29, 2024
作者: Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang
cs.AI
摘要
在视觉语言模型(VLMs)的快速发展中展现出了在涉及视觉背景的数学推理任务中具有巨大潜力。与人类不同,人类可以可靠地将解决步骤应用于具有轻微修改的类似问题,我们发现像GPT-4o这样的SOTA VLMs在这些情景中经常失败,揭示了它们在数学推理能力方面的局限性。在本文中,我们研究了VLMs中的数学推理鲁棒性,并评估了这些模型在同一问题的不同变体下的表现,例如视觉数字值或函数图的变化。虽然已经开发了几个基于视觉的数学基准来评估VLMs的问题解决能力,但这些基准仅包含静态问题集,无法轻松评估数学推理的鲁棒性。为了填补这一空白,我们引入了DynaMath,这是一个专为深入评估VLMs而设计的动态视觉数学基准。DynaMath包括501个高质量的多主题种子问题,每个问题都表示为Python程序。这些程序经过精心设计和注释,以便自动生成一个更大的具体问题集,包括许多不同类型的视觉和文本变化。DynaMath使我们能够评估VLMs的泛化能力,通过评估它们在种子问题的不同输入条件下的表现。我们使用5,010个生成的具体问题评估了14个SOTA VLMs。我们的结果显示,最坏情况模型准确率,定义为在所有10个变体中正确回答种子问题的百分比,明显低于平均情况准确率。我们的分析强调了研究VLMs推理能力鲁棒性的必要性,而DynaMath为指导开发更可靠的数学推理模型提供了宝贵的见解。
English
The rapid advancements in Vision-Language Models (VLMs) have shown great
potential in tackling mathematical reasoning tasks that involve visual context.
Unlike humans who can reliably apply solution steps to similar problems with
minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail
in these scenarios, revealing limitations in their mathematical reasoning
capabilities. In this paper, we investigate the mathematical reasoning
robustness in VLMs and evaluate how well these models perform under different
variants of the same question, such as changes in visual numerical values or
function graphs. While several vision-based math benchmarks have been developed
to assess VLMs' problem-solving capabilities, these benchmarks contain only
static sets of problems and cannot easily evaluate mathematical reasoning
robustness. To fill this gap, we introduce DynaMath, a dynamic visual math
benchmark designed for in-depth assessment of VLMs. DynaMath includes 501
high-quality, multi-topic seed questions, each represented as a Python program.
Those programs are carefully designed and annotated to enable the automatic
generation of a much larger set of concrete questions, including many different
types of visual and textual variations. DynaMath allows us to evaluate the
generalization ability of VLMs, by assessing their performance under varying
input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010
generated concrete questions. Our results show that the worst-case model
accuracy, defined as the percentage of correctly answered seed questions in all
10 variants, is significantly lower than the average-case accuracy. Our
analysis emphasizes the need to study the robustness of VLMs' reasoning
abilities, and DynaMath provides valuable insights to guide the development of
more reliable models for mathematical reasoning.Summary
AI-Generated Summary