博士级大语言模型是否真正掌握了基础加法？探究大语言模型中的规则学习与记忆机制

摘要

尽管大型语言模型（LLMs）在基准测试中得分颇高，却常在简单问题上失手，这引发了一个关键问题：LLMs是在学习数学原理，还是仅仅在记忆模式？不同于近期研究设计日益复杂的基准测试，我们通过考察基础的两整数加法（0至2^{64}），探究了两个核心特性：交换律（A+B=B+A）和组合泛化能力（通过同构符号映射，如7→y）。虽然最先进的LLMs在数值加法上达到了73.8%-99.8%的准确率，但在符号映射下，其表现骤降至≤7.5%，表明其未能泛化所学规则。随着数字位数增加而出现的非单调性能扩展，以及频繁违反交换律的情况（超过1700例A+B≠B+A），进一步支持了这一结论。明确提供加法规则反而使性能平均下降81.2%，而自我解释则保持了基线准确率，暗示LLM的算术处理与人类定义的原理存在偏差。我们的发现表明，当前LLMs依赖记忆模式而非真正的规则学习，凸显了架构上的局限，并强调了实现真正数学推理需要新方法的必要性。

English

Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition (0 to 2^{64}), probing two core properties: commutativity (A+B=B+A) and compositional generalization (via isomorphic symbolic mappings, e.g., 7 rightarrow y). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on numerical addition, performance collapses to leq7.5\% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of A+B neq B+A) further support this. Explicitly providing addition rules degrades performance by 81.2\% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.

博士级大语言模型是否真正掌握了基础加法？探究大语言模型中的规则学习与记忆机制

Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

摘要

Summary

Support

Support