背诵胜于推理:前沿语言模型为何会在小学水平推理题上失手?
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
April 1, 2025
作者: Kai Yan, Yufei Xu, Zhengyin Du, Xuesong Yao, Zheyu Wang, Xiaowen Guo, Jiecao Chen
cs.AI
摘要
近年来,大语言模型(LLM)基准测试的难度从小学水平迅速攀升至前沿问题,为研究者们编织了一个奇迹般的愿景:我们距离超越人类智能仅一步之遥。然而,LLM所展现出的卓越推理能力,究竟是符合人类标准的真正智能,还是仅仅在复述训练期间在互联网规模上见过的解决方案?为探究这一问题,我们提出了RoR-Bench,一个新颖的多模态基准测试,旨在检测LLM在面对条件微妙变化的简单推理问题时是否存在复述行为,并基于此基准进行了实证分析。令人惊讶的是,我们发现现有的顶尖LLM无一例外地表现出极其严重的复述行为;仅通过改变条件中的一个短语,如OpenAI-o1和DeepSeek-R1等顶级模型在小学水平的算术和推理问题上的表现可能骤降60%。这一发现为LLM领域敲响了警钟,促使我们重新评估顶尖LLM的真实智能水平。
English
The rapid escalation from elementary school-level to frontier problems of the
difficulty for LLM benchmarks in recent years have weaved a miracle for
researchers that we are only inches away from surpassing human intelligence.
However, is the LLMs' remarkable reasoning ability indeed comes from true
intelligence by human standards, or are they simply reciting solutions
witnessed during training at an Internet level? To study this problem, we
propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's
recitation behavior when asked simple reasoning problems but with conditions
subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly,
we found existing cutting-edge LLMs unanimously exhibits extremely severe
recitation behavior; by changing one phrase in the condition, top models such
as OpenAI-o1 and DeepSeek-R1 can suffer 60% performance loss on elementary
school-level arithmetic and reasoning problems. Such findings are a wake-up
call to the LLM community that compels us to re-evaluate the true intelligence
level of cutting-edge LLMs.Summary
AI-Generated Summary