DynaMath：一個動態視覺基準測試，用於評估視覺語言模型的數學推理穩健性

摘要

在視覺語言模型（VLMs）領域的快速進展展示了在處理涉及視覺背景的數學推理任務方面具有巨大潛力。與人類不同，人類可以可靠地將解決步驟應用於具有輕微修改的類似問題，我們發現像GPT-4o這樣的最先進VLMs在這些情況下經常失敗，揭示了它們在數學推理能力方面的限制。在本文中，我們研究了VLMs中的數學推理韌性，並評估了這些模型在相同問題的不同變體下的表現，例如視覺數值或函數圖形的變化。雖然已經開發了幾個基於視覺的數學基準來評估VLMs的解決問題能力，但這些基準僅包含靜態問題集，並且無法輕易評估數學推理韌性。為了填補這一空白，我們引入了DynaMath，一個旨在深入評估VLMs的動態視覺數學基準。DynaMath包括501個高質量、多主題種子問題，每個問題都表示為Python程序。這些程序經過精心設計和標註，以便自動生成更大量的具體問題集，包括許多不同類型的視覺和文本變化。DynaMath使我們能夠評估VLMs的泛化能力，通過評估它們在種子問題的不同輸入條件下的表現。我們使用5,010個生成的具體問題對14個最先進的VLMs進行了評估。我們的結果顯示，在所有10個變體中，最壞情況模型準確率，即在所有10個變體中正確回答種子問題的百分比，顯著低於平均情況準確率。我們的分析強調了研究VLMs推理能力韌性的必要性，而DynaMath提供了有價值的見解，以指導開發更可靠的數學推理模型。

English

The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.

DynaMath：一個動態視覺基準測試，用於評估視覺語言模型的數學推理穩健性

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

摘要

Summary

Support

Support