THOUGHTTERMINATOR:推理模型中的過度思考之基準測試、校準與緩解
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
April 17, 2025
作者: Xiao Pu, Michael Saxon, Wenyue Hua, William Yang Wang
cs.AI
摘要
推理模型在傳統語言模型難以應對的複雜任務上展現了卓越性能。然而,許多模型存在過度思考的問題——生成大量不必要的標記,這些標記並未提升問題解答的準確性。我們引入了問題難度的近似度量方法,並證明了問題難度與最佳標記消耗量之間存在明確關聯,同時評估了多種推理模型在有效分配最佳標記數量方面的校準程度。我們發現,總體而言,推理模型的校準效果不佳,尤其是在簡單問題上。為評估模型在簡單問題上的校準情況,我們引入了DUMB500數據集,該數據集包含極其簡單的數學、推理、編碼及任務問題,並將推理模型在這些簡單示例與現有前沿基準中同一任務領域的極難示例上進行聯合評估。最後,我們提出了THOUGHTTERMINATOR,這是一種無需訓練的黑箱解碼技術,能顯著提升推理模型的校準效果。
English
Reasoning models have demonstrated impressive performance on difficult tasks
that traditional language models struggle at. However, many are plagued with
the problem of overthinking--generating large amounts of unnecessary tokens
which don't improve accuracy on a question. We introduce approximate measures
of problem-level difficulty and demonstrate that a clear relationship between
problem difficulty and optimal token spend exists, and evaluate how well
calibrated a variety of reasoning models are in terms of efficiently allocating
the optimal token count. We find that in general, reasoning models are poorly
calibrated, particularly on easy problems. To evaluate calibration on easy
questions we introduce DUMB500, a dataset of extremely easy math, reasoning,
code, and task problems, and jointly evaluate reasoning model on these simple
examples and extremely difficult examples from existing frontier benchmarks on
the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free
black box decoding technique that significantly improves reasoning model
calibration.Summary
AI-Generated Summary