ChatPaper.aiChatPaper

大型语言模型中的数学推理:评估在广泛数字范围内的逻辑和算术错误

Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

February 12, 2025
作者: Safal Shrestha, Minwu Kim, Keith Ross
cs.AI

摘要

大型语言模型(LLMs)中的数学推理通常使用具有有限数值范围的基准进行评估,未能反映跨不同规模的真实世界问题解决。此外,大多数现有评估方法仅将模型输出与基本真实答案进行比较,掩盖了对推理过程的洞察。为了解决这些限制,我们引入了GSM-Ranges,这是一个从GSM8K衍生的数据集生成器,系统地扰动数学问题中的数值,以评估模型在不同数值规模下的稳健性。此外,我们提出了一种新颖的评分方法,区分逻辑和非逻辑错误,提供了对推理过程的更精确评估,超越了计算准确性。我们对各种模型进行的实验显示,随着数值复杂性的增加,逻辑错误率显著增加,高达14个百分点,表明在处理分布之外的数值时推理存在一般性弱点。此外,虽然模型在独立算术任务上表现出高准确性,但当计算嵌入到文字问题中时,它们的性能显著下降。这些发现全面评估了LLMs的数学推理能力,并为改进语言模型中数值泛化的未来研究方向提供了信息。
English
Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

Summary

AI-Generated Summary

PDF112February 14, 2025