THOUGHTTERMINATOR:推理模型中的过度思考问题——基准测试、校准与缓解策略
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
April 17, 2025
作者: Xiao Pu, Michael Saxon, Wenyue Hua, William Yang Wang
cs.AI
摘要
推理模型在传统语言模型难以应对的复杂任务上展现了卓越性能。然而,许多模型饱受“过度思考”之困——生成大量不必要的标记,却未能提升问题解答的准确性。我们引入了问题难度近似度量方法,揭示了问题难度与最优标记消耗量之间存在明确关联,并评估了多种推理模型在高效分配最优标记数量方面的校准程度。研究发现,总体而言,推理模型校准不佳,尤其在简单问题上表现尤为明显。为评估模型在简单问题上的校准情况,我们提出了DUMB500数据集,包含极其基础的数学、推理、代码及任务问题,并同步评估推理模型在这些简单示例与现有前沿基准中同一任务领域内极难示例上的表现。最后,我们介绍了THOUGHTTERMINATOR,一种无需训练的黑箱解码技术,显著提升了推理模型的校准效果。
English
Reasoning models have demonstrated impressive performance on difficult tasks
that traditional language models struggle at. However, many are plagued with
the problem of overthinking--generating large amounts of unnecessary tokens
which don't improve accuracy on a question. We introduce approximate measures
of problem-level difficulty and demonstrate that a clear relationship between
problem difficulty and optimal token spend exists, and evaluate how well
calibrated a variety of reasoning models are in terms of efficiently allocating
the optimal token count. We find that in general, reasoning models are poorly
calibrated, particularly on easy problems. To evaluate calibration on easy
questions we introduce DUMB500, a dataset of extremely easy math, reasoning,
code, and task problems, and jointly evaluate reasoning model on these simple
examples and extremely difficult examples from existing frontier benchmarks on
the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free
black box decoding technique that significantly improves reasoning model
calibration.Summary
AI-Generated Summary