THOUGHTTERMINATOR：推理模型中的過度思考之基準測試、校準與緩解

摘要

推理模型在傳統語言模型難以應對的複雜任務上展現了卓越性能。然而，許多模型存在過度思考的問題——生成大量不必要的標記，這些標記並未提升問題解答的準確性。我們引入了問題難度的近似度量方法，並證明了問題難度與最佳標記消耗量之間存在明確關聯，同時評估了多種推理模型在有效分配最佳標記數量方面的校準程度。我們發現，總體而言，推理模型的校準效果不佳，尤其是在簡單問題上。為評估模型在簡單問題上的校準情況，我們引入了DUMB500數據集，該數據集包含極其簡單的數學、推理、編碼及任務問題，並將推理模型在這些簡單示例與現有前沿基準中同一任務領域的極難示例上進行聯合評估。最後，我們提出了THOUGHTTERMINATOR，這是一種無需訓練的黑箱解碼技術，能顯著提升推理模型的校準效果。

English

Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking--generating large amounts of unnecessary tokens which don't improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.

THOUGHTTERMINATOR：推理模型中的過度思考之基準測試、校準與緩解

THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

摘要

Summary

Support

Support