S1-Bench:评估大型推理模型系统1思维能力的简易基准
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
April 14, 2025
作者: Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu
cs.AI
摘要
我们推出S1-Bench,这一新颖基准旨在评估大型推理模型(LRMs)在偏向直觉系统1思维而非深思熟虑系统2推理的简单任务上的表现。尽管LRMs通过显式的思维链在复杂推理任务中取得了重大突破,但它们对深度分析思维的依赖可能限制了其系统1思维的能力。此外,目前尚缺乏专门评估LRMs在需要此类能力的任务中表现的基准。为填补这一空白,S1-Bench提供了一套跨多个领域和语言的简单、多样且自然清晰的问题集,专门设计用于评估LRMs在此类任务中的表现。我们对22个LRMs的全面评估揭示了显著的效率低下趋势,其输出平均长度是传统小型LLMs的15.5倍。此外,LRMs往往早期识别出正确答案,却继续进行不必要的深思熟虑,部分模型甚至产生大量错误。这些发现凸显了当前LRMs僵化的推理模式,并强调了在实现能够根据任务复杂性适当调整的平衡双系统思维能力方面,仍需进行大量开发工作。
English
We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning
Models' (LRMs) performance on simple tasks that favor intuitive system 1
thinking rather than deliberative system 2 reasoning. While LRMs have achieved
significant breakthroughs in complex reasoning tasks through explicit chains of
thought, their reliance on deep analytical thinking may limit their system 1
thinking capabilities. Moreover, a lack of benchmark currently exists to
evaluate LRMs' performance in tasks that require such capabilities. To fill
this gap, S1-Bench presents a set of simple, diverse, and naturally clear
questions across multiple domains and languages, specifically designed to
assess LRMs' performance in such tasks. Our comprehensive evaluation of 22 LRMs
reveals significant lower efficiency tendencies, with outputs averaging 15.5
times longer than those of traditional small LLMs. Additionally, LRMs often
identify correct answers early but continue unnecessary deliberation, with some
models even producing numerous errors. These findings highlight the rigid
reasoning patterns of current LRMs and underscore the substantial development
needed to achieve balanced dual-system thinking capabilities that can adapt
appropriately to task complexity.Summary
AI-Generated Summary