DynaMath: 시각 언어 모델의 수학 추론 강인성을 평가하기 위한 동적 시각 벤치마크

초록

시각-언어 모델(Vision-Language Models, VLMs)의 신속한 발전은 시각적 맥락을 포함하는 수학적 추론 작업에 대한 큰 잠재력을 보여주었습니다. 비슷한 문제에 해결 단계를 신뢰할 수 있는 방법으로 적용할 수 있는 인간과는 달리, GPT-4o와 같은 최신 VLMs는 이러한 시나리오에서 일관되게 실패할 수 있다는 것을 발견했습니다. 이는 그들의 수학적 추론 능력에 제한이 있다는 것을 드러냅니다. 본 논문에서는 VLMs의 수학적 추론 강인성을 조사하고, 동일한 질문의 다양한 변형(시각적 수치 값 또는 함수 그래프의 변경)에 대한 이러한 모델의 성능을 평가합니다. 시각 기반 수학 벤치마크는 VLMs의 문제 해결 능력을 평가하기 위해 개발되었지만, 이러한 벤치마크는 정적 문제 세트만 포함하고 있어 수학적 추론 강인성을 쉽게 평가할 수 없습니다. 이러한 공백을 메우기 위해 우리는 VLMs의 심층적 평가를 위해 설계된 동적 시각 수학 벤치마크인 DynaMath를 소개합니다. DynaMath에는 파이썬 프로그램으로 표현된 501개의 고품질 다중 주제 시드 질문이 포함되어 있습니다. 이러한 프로그램은 다양한 시각적 및 텍스트 변형을 포함한 많은 다른 유형의 구체적인 질문 집합을 자동으로 생성할 수 있도록 신중하게 설계되고 주석이 달려 있습니다. DynaMath를 사용하면 시드 질문의 입력 조건이 다양한 경우에 모델의 일반화 능력을 평가할 수 있습니다. 우리는 5,010개의 생성된 구체적인 질문과 함께 14개의 최신 VLMs를 평가했습니다. 결과는 최악의 경우 모델 정확도, 즉 모든 10가지 변형에서 올바르게 답변된 시드 질문의 백분율로 정의된 것이 평균 경우 정확도보다 현저히 낮다는 것을 보여줍니다. 분석 결과는 VLMs의 추론 능력의 강인성을 연구해야 한다는 필요성을 강조하며, DynaMath는 수학적 추론을 위한 더 신뢰할 수 있는 모델 개발을 지원하는 소중한 통찰을 제공합니다.

English

The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.

DynaMath: 시각 언어 모델의 수학 추론 강인성을 평가하기 위한 동적 시각 벤치마크

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

초록

Support