不要想太多,2+3=? 對於 o1-Like LLMs 的過度思考

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

December 30, 2024
作者: Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
cs.AI

摘要

像 OpenAI o1 這樣的模型表現出色,歸功於它們在推論過程中能夠模擬類似人類的長期思考能力。這些模型採用延伸的思維鏈 (CoT) 過程,探索多種策略以增強解決問題的能力。然而,一個關鍵問題仍然存在:如何在測試過程中智能且有效地擴展計算資源。本文首次全面研究了這些模型中普遍存在的過度思考問題,即為簡單問題分配過多計算資源而獲益微乎其微。我們從結果和過程的角度引入了新穎的效率指標,以評估類似 o1 模型對計算資源的合理使用。通過自我訓練範式,我們提出了減輕過度思考的策略,使推理過程更加流暢,同時不影響準確性。實驗結果表明,我們的方法成功減少了計算開銷,同時在各種難度不同的測試集(如 GSM8K、MATH500、GPQA 和 AIME)上保持了模型性能。
English
The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.

Summary

AI-Generated Summary

PDF352December 31, 2024