2+3について考えすぎないでくださいか？o1-Like LLMsの過度な考え込みについて

要旨

OpenAI o1などのモデルの優れたパフォーマンスは、推論中に人間のような長期的な思考を模倣する能力に起因すると言えます。これらのモデルは、問題解決能力を向上させるために複数の戦略を探索する拡張されたChain-of-Thought（CoT）プロセスを採用しています。しかし、重要な問題が残されています。それは、テスト中に計算リソースを知的かつ効率的にスケーリングする方法です。本論文では、これらのモデルにおける過度な計算リソースの割り当てによる単純な問題への最小限の利益に対する普遍的な問題について初めて包括的な研究を提供します。私たちは、o1のようなモデルによる計算リソースの合理的な利用を評価するために、成果とプロセスの両面からの新しい効率指標を導入します。自己学習パラダイムを使用して、過度な思考を緩和し、精度を損なうことなく推論プロセスを合理化する戦略を提案します。実験結果は、提案手法がGSM8K、MATH500、GPQA、AIMEなどの難易度レベルの異なるテストセット全体で計算オーバーヘッドを効果的に削減し、モデルのパフォーマンスを維持することに成功していることを示しています。

English

The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.

2+3について考えすぎないでくださいか？o1-Like LLMsの過度な考え込みについて

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

要旨

Support