대규모 언어 모델에서의 수학적 추론: 넓은 수치 범위에서 논리 및 산술 오류 평가

초록

대규모 언어 모델(LLMs)에서의 수학적 추론은 종종 수치 범위가 제한된 벤치마크를 사용하여 평가되는데, 이는 다양한 규모에서의 실제 문제 해결을 반영하지 못하여 실패한다. 게다가, 대부분의 기존 평가 방법은 모델 출력을 정답과 비교하는 것만으로 추론 과정에 대한 통찰력을 흐리게 한다. 이러한 한계를 극복하기 위해, 우리는 GSM8K에서 유도된 데이터셋 생성기인 GSM-Ranges를 소개하여 수학 문제에서 수치 값을 체계적으로 왜곡하여 다양한 수치 범위에서 모델의 견고성을 평가한다. 게다가, 논리적 오류와 비논리적 오류를 구별하여 추론 과정을 보다 정확하게 평가하는 새로운 평가 방법론을 제안한다. 다양한 모델을 사용한 실험 결과, 수치 복잡성이 증가함에 따라 논리적 오류율이 최대 14%포인트까지 증가하는 것으로 나타나며, 분포 밖 수치 값에 대한 추론의 일반적인 약점을 보여준다. 게다가, 모델은 독립적인 산술 작업에서 높은 정확도를 보이지만, 계산이 단어 문제 안에 포함될 때 성능이 크게 저하된다. 이러한 결과는 LLMs의 수학적 추론 능력에 대한 포괄적인 평가를 제공하며, 언어 모델에서 수치 일반화를 개선하기 위한 미래 연구 방향에 대한 정보를 제공한다.

English

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

대규모 언어 모델에서의 수학적 추론: 넓은 수치 범위에서 논리 및 산술 오류 평가

Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

초록

Summary

Support