DynaMath:一個動態視覺基準測試,用於評估視覺語言模型的數學推理穩健性
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
October 29, 2024
作者: Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang
cs.AI
摘要
在視覺語言模型(VLMs)領域的快速進展展示了在處理涉及視覺背景的數學推理任務方面具有巨大潛力。與人類不同,人類可以可靠地將解決步驟應用於具有輕微修改的類似問題,我們發現像GPT-4o這樣的最先進VLMs在這些情況下經常失敗,揭示了它們在數學推理能力方面的限制。在本文中,我們研究了VLMs中的數學推理韌性,並評估了這些模型在相同問題的不同變體下的表現,例如視覺數值或函數圖形的變化。雖然已經開發了幾個基於視覺的數學基準來評估VLMs的解決問題能力,但這些基準僅包含靜態問題集,並且無法輕易評估數學推理韌性。為了填補這一空白,我們引入了DynaMath,一個旨在深入評估VLMs的動態視覺數學基準。DynaMath包括501個高質量、多主題種子問題,每個問題都表示為Python程序。這些程序經過精心設計和標註,以便自動生成更大量的具體問題集,包括許多不同類型的視覺和文本變化。DynaMath使我們能夠評估VLMs的泛化能力,通過評估它們在種子問題的不同輸入條件下的表現。我們使用5,010個生成的具體問題對14個最先進的VLMs進行了評估。我們的結果顯示,在所有10個變體中,最壞情況模型準確率,即在所有10個變體中正確回答種子問題的百分比,顯著低於平均情況準確率。我們的分析強調了研究VLMs推理能力韌性的必要性,而DynaMath提供了有價值的見解,以指導開發更可靠的數學推理模型。
English
The rapid advancements in Vision-Language Models (VLMs) have shown great
potential in tackling mathematical reasoning tasks that involve visual context.
Unlike humans who can reliably apply solution steps to similar problems with
minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail
in these scenarios, revealing limitations in their mathematical reasoning
capabilities. In this paper, we investigate the mathematical reasoning
robustness in VLMs and evaluate how well these models perform under different
variants of the same question, such as changes in visual numerical values or
function graphs. While several vision-based math benchmarks have been developed
to assess VLMs' problem-solving capabilities, these benchmarks contain only
static sets of problems and cannot easily evaluate mathematical reasoning
robustness. To fill this gap, we introduce DynaMath, a dynamic visual math
benchmark designed for in-depth assessment of VLMs. DynaMath includes 501
high-quality, multi-topic seed questions, each represented as a Python program.
Those programs are carefully designed and annotated to enable the automatic
generation of a much larger set of concrete questions, including many different
types of visual and textual variations. DynaMath allows us to evaluate the
generalization ability of VLMs, by assessing their performance under varying
input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010
generated concrete questions. Our results show that the worst-case model
accuracy, defined as the percentage of correctly answered seed questions in all
10 variants, is significantly lower than the average-case accuracy. Our
analysis emphasizes the need to study the robustness of VLMs' reasoning
abilities, and DynaMath provides valuable insights to guide the development of
more reliable models for mathematical reasoning.Summary
AI-Generated Summary