背诵胜于推理：前沿语言模型为何会在小学水平推理题上失手？

摘要

近年来，大语言模型（LLM）基准测试的难度从小学水平迅速攀升至前沿问题，为研究者们编织了一个奇迹般的愿景：我们距离超越人类智能仅一步之遥。然而，LLM所展现出的卓越推理能力，究竟是符合人类标准的真正智能，还是仅仅在复述训练期间在互联网规模上见过的解决方案？为探究这一问题，我们提出了RoR-Bench，一个新颖的多模态基准测试，旨在检测LLM在面对条件微妙变化的简单推理问题时是否存在复述行为，并基于此基准进行了实证分析。令人惊讶的是，我们发现现有的顶尖LLM无一例外地表现出极其严重的复述行为；仅通过改变条件中的一个短语，如OpenAI-o1和DeepSeek-R1等顶级模型在小学水平的算术和推理问题上的表现可能骤降60%。这一发现为LLM领域敲响了警钟，促使我们重新评估顶尖LLM的真实智能水平。

English

The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60% performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.

背诵胜于推理：前沿语言模型为何会在小学水平推理题上失手？

Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

摘要

Summary

Support

Support