推理模型無需思考也能有效運作

摘要

近期的大型語言模型（LLMs）在推理能力上取得了顯著進步，這主要歸功於在生成過程中加入了明確且冗長的思考過程。本文質疑這種明確的思考是否必要。利用最先進的DeepSeek-R1-Distill-Qwen模型，我們發現通過簡單提示繞過思考過程（稱為NoThinking）可以出奇地有效。在控制token數量的情況下，NoThinking在七個具有挑戰性的推理數據集上（包括數學問題解決、形式定理證明和編碼）均優於Thinking，特別是在低預算設置下，例如在ACM 23數據集上，使用700個token時，NoThinking的得分為51.3，而Thinking僅為28.9。值得注意的是，隨著k值的增加，NoThinking在pass@k指標下的表現變得更加具有競爭力。基於這一觀察，我們展示了一種並行擴展方法，該方法使用NoThinking獨立生成N個輸出並將其聚合，效果顯著。對於聚合，我們在可用時使用任務特定的驗證器，或者應用基於置信度的簡單best-of-N策略。我們的方法在使用Thinking的相似延遲下優於一系列基線，並且與顯著更長延遲（最多9倍）的Thinking相當。總之，我們的研究鼓勵重新考慮冗長思考過程的必要性，同時也為在低預算設置或低延遲下通過並行擴展實現強推理性能建立了具有競爭力的參考。

English

Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When controlling for the number of tokens, NoThinking outperforms Thinking across a diverse set of seven challenging reasoning datasets--including mathematical problem solving, formal theorem proving, and coding--especially in low-budget settings, e.g., 51.3 vs. 28.9 on ACM 23 with 700 tokens. Notably, the performance of NoThinking becomes more competitive with pass@k as k increases. Building on this observation, we demonstrate that a parallel scaling approach that uses NoThinking to generate N outputs independently and aggregates them is highly effective. For aggregation, we use task-specific verifiers when available, or we apply simple best-of-N strategies such as confidence-based selection. Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x). Together, our research encourages a reconsideration of the necessity of lengthy thinking processes, while also establishing a competitive reference for achieving strong reasoning performance in low-budget settings or at low latency using parallel scaling.

推理模型無需思考也能有效運作

Reasoning Models Can Be Effective Without Thinking

摘要

Summary

Support

Support