大型语言模型中的数学推理：评估在广泛数字范围内的逻辑和算术错误

摘要

大型语言模型（LLMs）中的数学推理通常使用具有有限数值范围的基准进行评估，未能反映跨不同规模的真实世界问题解决。此外，大多数现有评估方法仅将模型输出与基本真实答案进行比较，掩盖了对推理过程的洞察。为了解决这些限制，我们引入了GSM-Ranges，这是一个从GSM8K衍生的数据集生成器，系统地扰动数学问题中的数值，以评估模型在不同数值规模下的稳健性。此外，我们提出了一种新颖的评分方法，区分逻辑和非逻辑错误，提供了对推理过程的更精确评估，超越了计算准确性。我们对各种模型进行的实验显示，随着数值复杂性的增加，逻辑错误率显著增加，高达14个百分点，表明在处理分布之外的数值时推理存在一般性弱点。此外，虽然模型在独立算术任务上表现出高准确性，但当计算嵌入到文字问题中时，它们的性能显著下降。这些发现全面评估了LLMs的数学推理能力，并为改进语言模型中数值泛化的未来研究方向提供了信息。

English

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

大型语言模型中的数学推理：评估在广泛数字范围内的逻辑和算术错误

Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

摘要

Summary

Support