S1-Bench:評估大型推理模型系統一思維能力的簡易基準
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
April 14, 2025
作者: Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu
cs.AI
摘要
我們推出了S1-Bench,這是一個新穎的基準測試,旨在評估大型推理模型(LRMs)在偏向直覺系統1思維而非深思熟慮系統2推理的簡單任務上的表現。儘管LRMs通過顯式的思維鏈在複雜推理任務中取得了重大突破,但它們對深度分析思維的依賴可能限制了其系統1思維能力。此外,目前缺乏評估LRMs在需要此類能力的任務中表現的基準測試。為填補這一空白,S1-Bench提供了一組跨多個領域和語言的簡單、多樣且自然清晰的問題,專門設計用於評估LRMs在此類任務中的表現。我們對22個LRMs的全面評估揭示了顯著的效率低下趨勢,其輸出平均比傳統小型LLMs長15.5倍。此外,LRMs經常在早期識別出正確答案,但卻繼續進行不必要的深思熟慮,有些模型甚至產生了大量錯誤。這些發現突顯了當前LRMs的僵化推理模式,並強調了在實現能夠適應任務複雜性的平衡雙系統思維能力方面所需的重大發展。
English
We introduce S1-Bench, a novel benchmark designed to evaluate Large Reasoning
Models' (LRMs) performance on simple tasks that favor intuitive system 1
thinking rather than deliberative system 2 reasoning. While LRMs have achieved
significant breakthroughs in complex reasoning tasks through explicit chains of
thought, their reliance on deep analytical thinking may limit their system 1
thinking capabilities. Moreover, a lack of benchmark currently exists to
evaluate LRMs' performance in tasks that require such capabilities. To fill
this gap, S1-Bench presents a set of simple, diverse, and naturally clear
questions across multiple domains and languages, specifically designed to
assess LRMs' performance in such tasks. Our comprehensive evaluation of 22 LRMs
reveals significant lower efficiency tendencies, with outputs averaging 15.5
times longer than those of traditional small LLMs. Additionally, LRMs often
identify correct answers early but continue unnecessary deliberation, with some
models even producing numerous errors. These findings highlight the rigid
reasoning patterns of current LRMs and underscore the substantial development
needed to achieve balanced dual-system thinking capabilities that can adapt
appropriately to task complexity.Summary
AI-Generated Summary