FINEREASON：通过反思性谜题求解评估与提升大语言模型的审慎推理能力

摘要

许多具有挑战性的推理任务不仅需要快速、直觉性的反应，更依赖于一种更为审慎、多步骤的解决方式。近期在大规模语言模型（LLMs）上的进展，标志着从“系统1”式的快速反应向“系统2”式的反思与纠错问题解决模式的重要转变。然而，当前的基准测试过分依赖最终答案的准确性，忽视了模型在推理过程中的中间步骤，从而无法全面评估模型在推理过程中反思与修正错误的能力。为填补这一空白，我们推出了FINEREASON，一个逻辑谜题基准，旨在对LLMs的推理能力进行细粒度评估。每个谜题均可分解为原子步骤，这为严格验证中间步骤的正确性提供了理想条件。在此基础上，我们引入了两项任务：状态检查与状态转移，以全面评估模型如何评估当前情境并规划下一步行动。为支持更广泛的研究，我们还提供了一个谜题训练集，旨在提升模型在一般数学任务上的表现。实验表明，经过我们状态检查与转移数据训练的模型，在GSM8K数学推理任务上的性能提升了高达5.1%。

English

Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing performance on general mathematical tasks. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.

FINEREASON：通过反思性谜题求解评估与提升大语言模型的审慎推理能力

FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving

摘要

Summary

Support

Support