ChatPaper.aiChatPaper

大型语言模型和数学推理失败

Large Language Models and Mathematical Reasoning Failures

February 17, 2025
作者: Johan Boye, Birger Moell
cs.AI

摘要

本文研究了大型语言模型(LLMs)在50个新构建的高中水平单词问题上的数学推理能力。与先前侧重于答案正确性的研究不同,我们严格分析最终答案和解决步骤,以识别推理失败。评估了包括Mixtral、Llama、Gemini、GPT-4o和OpenAI的o1变体在内的八种最先进模型,我们发现,尽管新模型(例如o3-mini、deepseek-r1)实现了更高的准确性,但所有模型在空间推理、战略规划和算术方面都存在错误,有时通过错误的逻辑得出正确答案。常见的失败模式包括毫无根据的假设、过度依赖数字模式以及难以将物理直觉转化为数学步骤。手动分析显示,模型在需要多步推断或现实世界知识的问题上遇到困难,尽管具有广泛的数学知识。我们的结果强调了评估推理过程的重要性,而非仅仅是答案,并警告不要高估LLMs的问题解决能力。该研究突出了LLMs在泛化能力方面持续存在的差距,强调了有必要针对结构化推理和约束处理进行有针对性的改进。
English
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.

Summary

AI-Generated Summary

PDF33February 18, 2025