FINEREASON:通过反思性谜题求解评估与提升大语言模型的审慎推理能力
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving
February 27, 2025
作者: Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong
cs.AI
摘要
许多具有挑战性的推理任务不仅需要快速、直觉性的反应,更依赖于一种更为审慎、多步骤的解决方式。近期在大规模语言模型(LLMs)上的进展,标志着从“系统1”式的快速反应向“系统2”式的反思与纠错问题解决模式的重要转变。然而,当前的基准测试过分依赖最终答案的准确性,忽视了模型在推理过程中的中间步骤,从而无法全面评估模型在推理过程中反思与修正错误的能力。为填补这一空白,我们推出了FINEREASON,一个逻辑谜题基准,旨在对LLMs的推理能力进行细粒度评估。每个谜题均可分解为原子步骤,这为严格验证中间步骤的正确性提供了理想条件。在此基础上,我们引入了两项任务:状态检查与状态转移,以全面评估模型如何评估当前情境并规划下一步行动。为支持更广泛的研究,我们还提供了一个谜题训练集,旨在提升模型在一般数学任务上的表现。实验表明,经过我们状态检查与转移数据训练的模型,在GSM8K数学推理任务上的性能提升了高达5.1%。
English
Many challenging reasoning tasks require not just rapid, intuitive responses,
but a more deliberate, multi-step approach. Recent progress in large language
models (LLMs) highlights an important shift from the "System 1" way of quick
reactions to the "System 2" style of reflection-and-correction problem solving.
However, current benchmarks heavily rely on the final-answer accuracy, leaving
much of a model's intermediate reasoning steps unexamined. This fails to assess
the model's ability to reflect and rectify mistakes within the reasoning
process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark
for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be
decomposed into atomic steps, making it ideal for rigorous validation of
intermediate correctness. Building on this, we introduce two tasks: state
checking, and state transition, for a comprehensive evaluation of how models
assess the current situation and plan the next move. To support broader
research, we also provide a puzzle training set aimed at enhancing performance
on general mathematical tasks. We show that models trained on our state
checking and transition data demonstrate gains in math reasoning by up to 5.1%
on GSM8K.Summary
AI-Generated Summary