대규모 언어 모델과 수학적 추론 실패

초록

본 논문은 50개의 새로 작성된 고교 수준의 단어 문제를 사용하여 대규모 언어 모델(Large Language Models, LLMs)의 수학적 추론 능력을 조사합니다. 이전 연구들이 주로 정답의 정확성에만 초점을 맞추는 반면, 우리는 최종 답변과 해결 과정을 모두 철저히 분석하여 추론 실패를 식별합니다. Mixtral, Llama, Gemini, GPT-4o, OpenAI의 o1 변형을 포함한 여덟 가지 최첨단 모델을 평가한 결과, o3-mini, deepseek-r1과 같은 최신 모델들이 더 높은 정확도를 달성하지만, 모든 모델이 공간 추론, 전략적 계획, 산술에서 오류를 보이며 때로는 잘못된 논리를 통해 올바른 답변을 내놓습니다. 흔한 실패 모드로는 타당하지 않은 가정, 숫자 패턴에 대한 지나친 의존, 물리적 직관을 수학적 단계로 옮기는 데 어려움이 포함됩니다. 수동 분석 결과, 모델들이 다단계 추론이나 현실 지식이 필요한 문제에서 고민하는 것으로 나타났으며, 넓은 수학적 지식을 보유하고 있음에도 불구하고 일반적인 추론 능력에 계속적인 공백이 있음을 강조합니다. 우리의 결과는 답변뿐만 아니라 추론 과정을 평가하는 중요성을 강조하며, LLMs의 문제 해결 능력을 과대평가하는 데 주의를 줍니다. 이 연구는 LLMs의 일반화 능력에 지속적인 공백을 강조하며, 구조화된 추론과 제약 처리에 대한 목표 지향적 개선의 필요성을 강조합니다.

English

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze both final answers and solution steps to identify reasoning failures. Evaluating eight state-of-the-art models - including Mixtral, Llama, Gemini, GPT-4o, and OpenAI's o1 variants - we find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic, sometimes producing correct answers through flawed logic. Common failure modes include unwarranted assumptions, over-reliance on numerical patterns, and difficulty translating physical intuition into mathematical steps. Manual analysis reveals that models struggle with problems requiring multi-step deduction or real-world knowledge, despite possessing broad mathematical knowledge. Our results underscore the importance of evaluating reasoning processes, not just answers, and caution against overestimating LLMs' problem-solving proficiency. The study highlights persistent gaps in LLMs' generalization abilities, emphasizing the need for targeted improvements in structured reasoning and constraint handling.

대규모 언어 모델과 수학적 추론 실패

Large Language Models and Mathematical Reasoning Failures

초록

Support