关于大型语言模型在逻辑推理中的记忆化
On Memorization of Large Language Models in Logical Reasoning
October 30, 2024
作者: Chulin Xie, Yangsibo Huang, Chiyuan Zhang, Da Yu, Xinyun Chen, Bill Yuchen Lin, Bo Li, Badih Ghazi, Ravi Kumar
cs.AI
摘要
大型语言模型(LLMs)在具有挑战性的推理基准上取得了良好的性能,但也可能出现基本推理错误。当涉及理解LLMs推理能力背后的机制时,这种对比行为令人困惑。一个假设是,在常见推理基准上越来越高且几乎饱和的性能可能是由于类似问题的记忆。在本文中,我们通过对基于“骑士与诡计者”(K&K)谜题的动态生成的逻辑推理基准进行定量记忆度量,系统地研究了这一假设。我们发现,LLMs在微调后可以插值训练谜题(达到几乎完美的准确率),但在这些谜题稍作扰动时会失败,这表明模型在解决这些训练谜题时严重依赖记忆。另一方面,我们表明,虽然微调会导致大量记忆,但也始终改善泛化性能。通过扰动测试、跨难度级别的可转移性、探测模型内部以及使用错误答案进行微调的深入分析表明,尽管训练数据被记忆,LLMs仍学会在K&K谜题上推理。这种现象表明,LLMs在记忆和真正推理能力之间展现出复杂的相互作用。最后,我们通过每个样本的记忆度量分数的分析揭示了LLMs在解决逻辑谜题时如何在推理和记忆之间切换。我们的代码和数据可在https://memkklogic.github.io 上获取。
English
Large language models (LLMs) achieve good performance on challenging
reasoning benchmarks, yet could also make basic reasoning mistakes. This
contrasting behavior is puzzling when it comes to understanding the mechanisms
behind LLMs' reasoning capabilities. One hypothesis is that the increasingly
high and nearly saturated performance on common reasoning benchmarks could be
due to the memorization of similar problems. In this paper, we systematically
investigate this hypothesis with a quantitative measurement of memorization in
reasoning tasks, using a dynamically generated logical reasoning benchmark
based on Knights and Knaves (K&K) puzzles. We found that LLMs could interpolate
the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet
fail when those puzzles are slightly perturbed, suggesting that the models
heavily rely on memorization to solve those training puzzles. On the other
hand, we show that while fine-tuning leads to heavy memorization, it also
consistently improves generalization performance. In-depth analyses with
perturbation tests, cross difficulty-level transferability, probing model
internals, and fine-tuning with wrong answers suggest that the LLMs learn to
reason on K&K puzzles despite training data memorization. This phenomenon
indicates that LLMs exhibit a complex interplay between memorization and
genuine reasoning abilities. Finally, our analysis with per-sample memorization
score sheds light on how LLMs switch between reasoning and memorization in
solving logical puzzles. Our code and data are available at
https://memkklogic.github.io.Summary
AI-Generated Summary