SpecReason：通過推測性推理實現快速且準確的推斷時計算

摘要

近期，推理時計算的進展顯著提升了在複雜任務上的表現，這主要得益於使用大型推理模型（LRMs）生成長鏈的思維鏈（CoTs）。然而，這種準確性的提升伴隨著高推理延遲的代價，這是由於生成的推理序列長度以及解碼的自回歸特性所致。我們在應對這些開銷的關鍵洞察是，LRM推理及其所嵌入的推理過程對近似具有高度容忍性：複雜任務通常被分解為更簡單的步驟，每一步的效用基於其為後續步驟提供的語義洞察，而非其生成的確切詞元。據此，我們引入了SpecReason系統，該系統通過使用輕量級模型（推測性地）執行較簡單的中間推理步驟，並保留昂貴的基礎模型僅用於評估（並可能修正）推測的輸出，從而自動加速LRM推理。重要的是，SpecReason著眼於利用思維詞元在保持最終答案準確性方面的語義靈活性，這與先前的推測技術（尤其是要求每一步詞元級等價的推測解碼）形成互補。在多種推理基準測試中，SpecReason相比於標準的LRM推理實現了1.5至2.5倍的加速，同時將準確性提高了1.0%至9.9%。與未結合SpecReason的推測解碼相比，二者的結合進一步帶來了19.4%至44.2%的延遲降低。我們已在https://github.com/ruipeterpan/specreason開源了SpecReason。

English

Recent advances in inference-time compute have significantly improved performance on complex tasks by generating long chains of thought (CoTs) using Large Reasoning Models (LRMs). However, this improved accuracy comes at the cost of high inference latency due to the length of generated reasoning sequences and the autoregressive nature of decoding. Our key insight in tackling these overheads is that LRM inference, and the reasoning that it embeds, is highly tolerant of approximations: complex tasks are typically broken down into simpler steps, each of which brings utility based on the semantic insight it provides for downstream steps rather than the exact tokens it generates. Accordingly, we introduce SpecReason, a system that automatically accelerates LRM inference by using a lightweight model to (speculatively) carry out simpler intermediate reasoning steps and reserving the costly base model only to assess (and potentially correct) the speculated outputs. Importantly, SpecReason's focus on exploiting the semantic flexibility of thinking tokens in preserving final-answer accuracy is complementary to prior speculation techniques, most notably speculative decoding, which demands token-level equivalence at each step. Across a variety of reasoning benchmarks, SpecReason achieves 1.5-2.5times speedup over vanilla LRM inference while improving accuracy by 1.0-9.9\%. Compared to speculative decoding without SpecReason, their combination yields an additional 19.4-44.2\% latency reduction. We open-source SpecReason at https://github.com/ruipeterpan/specreason.

SpecReason：通過推測性推理實現快速且準確的推斷時計算

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

摘要

Summary

Support

Support