大型语言模型和数学推理失败

摘要

本文研究了大型语言模型（LLMs）在50个新构建的高中水平单词问题上的数学推理能力。与先前侧重于答案正确性的研究不同，我们严格分析最终答案和解决步骤，以识别推理失败。评估了包括Mixtral、Llama、Gemini、GPT-4o和OpenAI的o1变体在内的八种最先进模型，我们发现，尽管新模型（例如o3-mini、deepseek-r1）实现了更高的准确性，但所有模型在空间推理、战略规划和算术方面都存在错误，有时通过错误的逻辑得出正确答案。常见的失败模式包括毫无根据的假设、过度依赖数字模式以及难以将物理直觉转化为数学步骤。手动分析显示，模型在需要多步推断或现实世界知识的问题上遇到困难，尽管具有广泛的数学知识。我们的结果强调了评估推理过程的重要性，而非仅仅是答案，并警告不要高估LLMs的问题解决能力。该研究突出了LLMs在泛化能力方面持续存在的差距，强调了有必要针对结构化推理和约束处理进行有针对性的改进。

English

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.

大型语言模型和数学推理失败

Large Language Models and Mathematical Reasoning Failures

摘要

Summary

Support