博士級大型語言模型是否真正掌握基礎加法?探討大型語言模型中的規則學習與記憶之別
Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models
April 7, 2025
作者: Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan
cs.AI
摘要
儘管大型語言模型(LLMs)在基準測試中得分很高,卻常常無法解決簡單問題,這引發了一個關鍵疑問:LLMs 是否真正學習了數學原理,還是僅僅記住了模式?與近期研究設計日益複雜的基準測試不同,我們通過基礎的兩位整數加法(0 到 2^{64})來探究這一問題,重點考察兩個核心特性:交換律(A+B=B+A)和組合泛化能力(通過同構符號映射,例如 7 → y)。雖然最先進的 LLMs 在數值加法上達到了 73.8-99.8% 的準確率,但在符號映射下的表現卻驟降至 ≤7.5%,表明其未能泛化所學規則。隨著位數增加而出現的非單調性能擴展,以及頻繁的交換律違反(超過 1,700 例 A+B ≠ B+A),進一步支持了這一結論。明確提供加法規則會使性能平均下降 81.2%,而自我解釋則保持了基準準確率,這表明 LLM 的算術處理與人類定義的原則存在偏差。我們的研究結果表明,當前 LLMs 依賴於記憶模式而非真正的規則學習,突顯了其架構上的局限性,並強調了需要新方法來實現真正的數學推理。
English
Despite high benchmark scores, Large Language Models (LLMs) often fail simple
problem, raising a critical question: Do LLMs learn mathematical principles or
merely memorize patterns? Rather than designing increasingly complex benchmarks
like recent works, we investigate this using elementary two-integer addition
(0 to 2^{64}), probing two core properties: commutativity (A+B=B+A) and
compositional generalization (via isomorphic symbolic mappings, e.g., 7
rightarrow y). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on
numerical addition, performance collapses to leq7.5\% under symbolic
mapping, indicating failure to generalize learned rules. Non-monotonic
performance scaling with digit count and frequent commutativity violations
(over 1,700 cases of A+B neq B+A) further support this. Explicitly providing
addition rules degrades performance by 81.2\% on average, while
self-explanation maintains baseline accuracy, suggesting LLM arithmetic
processing is misaligned with human-defined principles. Our findings indicate
current LLMs rely on memory pattern over genuine rule learning, highlighting
architectural limitations and the need for new approaches to achieve true
mathematical reasoning.Summary
AI-Generated Summary