博士級大型語言模型是否真正掌握基礎加法？探討大型語言模型中的規則學習與記憶之別

摘要

儘管大型語言模型（LLMs）在基準測試中得分很高，卻常常無法解決簡單問題，這引發了一個關鍵疑問：LLMs 是否真正學習了數學原理，還是僅僅記住了模式？與近期研究設計日益複雜的基準測試不同，我們通過基礎的兩位整數加法（0 到 2^{64}）來探究這一問題，重點考察兩個核心特性：交換律（A+B=B+A）和組合泛化能力（通過同構符號映射，例如 7 → y）。雖然最先進的 LLMs 在數值加法上達到了 73.8-99.8% 的準確率，但在符號映射下的表現卻驟降至 ≤7.5%，表明其未能泛化所學規則。隨著位數增加而出現的非單調性能擴展，以及頻繁的交換律違反（超過 1,700 例 A+B ≠ B+A），進一步支持了這一結論。明確提供加法規則會使性能平均下降 81.2%，而自我解釋則保持了基準準確率，這表明 LLM 的算術處理與人類定義的原則存在偏差。我們的研究結果表明，當前 LLMs 依賴於記憶模式而非真正的規則學習，突顯了其架構上的局限性，並強調了需要新方法來實現真正的數學推理。

English

Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition (0 to 2^{64}), probing two core properties: commutativity (A+B=B+A) and compositional generalization (via isomorphic symbolic mappings, e.g., 7 rightarrow y). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on numerical addition, performance collapses to leq7.5\% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of A+B neq B+A) further support this. Explicitly providing addition rules degrades performance by 81.2\% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.

博士級大型語言模型是否真正掌握基礎加法？探討大型語言模型中的規則學習與記憶之別

Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

摘要

Summary

Support

Support