大型语言模型中的数学推理:评估在广泛数字范围内的逻辑和算术错误
Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges
February 12, 2025
作者: Safal Shrestha, Minwu Kim, Keith Ross
cs.AI
摘要
大型语言模型(LLMs)中的数学推理通常使用具有有限数值范围的基准进行评估,未能反映跨不同规模的真实世界问题解决。此外,大多数现有评估方法仅将模型输出与基本真实答案进行比较,掩盖了对推理过程的洞察。为了解决这些限制,我们引入了GSM-Ranges,这是一个从GSM8K衍生的数据集生成器,系统地扰动数学问题中的数值,以评估模型在不同数值规模下的稳健性。此外,我们提出了一种新颖的评分方法,区分逻辑和非逻辑错误,提供了对推理过程的更精确评估,超越了计算准确性。我们对各种模型进行的实验显示,随着数值复杂性的增加,逻辑错误率显著增加,高达14个百分点,表明在处理分布之外的数值时推理存在一般性弱点。此外,虽然模型在独立算术任务上表现出高准确性,但当计算嵌入到文字问题中时,它们的性能显著下降。这些发现全面评估了LLMs的数学推理能力,并为改进语言模型中数值泛化的未来研究方向提供了信息。
English
Mathematical reasoning in Large Language Models (LLMs) is often evaluated
using benchmarks with limited numerical ranges, failing to reflect real-world
problem-solving across diverse scales. Furthermore, most existing evaluation
methods only compare model outputs to ground-truth answers, obscuring insights
into reasoning processes. To address these limitations, we introduce
GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs
numerical values in math problems to assess model robustness across varying
numerical scales. Additionally, we propose a novel grading methodology that
distinguishes between logical and non-logical errors, offering a more precise
evaluation of reasoning processes beyond computational accuracy. Our
experiments with various models reveal a significant increase in logical error
rates-up to 14 percentage points-as numerical complexity rises, demonstrating a
general weakness in reasoning with out-of-distribution numerical values.
Moreover, while models demonstrate high accuracy on standalone arithmetic
tasks, their performance deteriorates substantially when computations are
embedded within word problems. These findings provide a comprehensive
evaluation of LLMs' mathematical reasoning capabilities and inform future
research directions for improving numerical generalization in language models.Summary
AI-Generated Summary