博士级大语言模型是否真正掌握了基础加法?探究大语言模型中的规则学习与记忆机制
Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models
April 7, 2025
作者: Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan
cs.AI
摘要
尽管大型语言模型(LLMs)在基准测试中得分颇高,却常在简单问题上失手,这引发了一个关键问题:LLMs是在学习数学原理,还是仅仅在记忆模式?不同于近期研究设计日益复杂的基准测试,我们通过考察基础的两整数加法(0至2^{64}),探究了两个核心特性:交换律(A+B=B+A)和组合泛化能力(通过同构符号映射,如7→y)。虽然最先进的LLMs在数值加法上达到了73.8%-99.8%的准确率,但在符号映射下,其表现骤降至≤7.5%,表明其未能泛化所学规则。随着数字位数增加而出现的非单调性能扩展,以及频繁违反交换律的情况(超过1700例A+B≠B+A),进一步支持了这一结论。明确提供加法规则反而使性能平均下降81.2%,而自我解释则保持了基线准确率,暗示LLM的算术处理与人类定义的原理存在偏差。我们的发现表明,当前LLMs依赖记忆模式而非真正的规则学习,凸显了架构上的局限,并强调了实现真正数学推理需要新方法的必要性。
English
Despite high benchmark scores, Large Language Models (LLMs) often fail simple
problem, raising a critical question: Do LLMs learn mathematical principles or
merely memorize patterns? Rather than designing increasingly complex benchmarks
like recent works, we investigate this using elementary two-integer addition
(0 to 2^{64}), probing two core properties: commutativity (A+B=B+A) and
compositional generalization (via isomorphic symbolic mappings, e.g., 7
rightarrow y). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on
numerical addition, performance collapses to leq7.5\% under symbolic
mapping, indicating failure to generalize learned rules. Non-monotonic
performance scaling with digit count and frequent commutativity violations
(over 1,700 cases of A+B neq B+A) further support this. Explicitly providing
addition rules degrades performance by 81.2\% on average, while
self-explanation maintains baseline accuracy, suggesting LLM arithmetic
processing is misaligned with human-defined principles. Our findings indicate
current LLMs rely on memory pattern over genuine rule learning, highlighting
architectural limitations and the need for new approaches to achieve true
mathematical reasoning.Summary
AI-Generated Summary