SpecReason:通过推测推理实现快速准确的推理时计算
SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
April 10, 2025
作者: Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, Ravi Netravali
cs.AI
摘要
近期,推理计算技术的进步通过利用大型推理模型(LRMs)生成长链思维(CoTs),在复杂任务上显著提升了性能。然而,这种准确性的提升伴随着高推理延迟的代价,这源于生成的推理序列长度及解码过程中的自回归特性。我们解决这些开销的关键洞见在于,LRM推理及其嵌入的推理过程对近似处理具有高度容忍性:复杂任务通常被分解为更简单的步骤,每一步的效用基于其为后续步骤提供的语义洞察,而非其生成的确切标记。基于此,我们引入了SpecReason系统,该系统通过使用轻量级模型(推测性地)执行较简单的中间推理步骤,并仅保留成本高昂的基础模型来评估(并可能修正)推测输出,从而自动加速LRM推理。重要的是,SpecReason专注于利用思维标记在保持最终答案准确性方面的语义灵活性,这与之前的推测技术(尤其是要求每一步标记级等价的推测解码)形成互补。在多种推理基准测试中,SpecReason相比原始LRM推理实现了1.5至2.5倍的加速,同时将准确性提高了1.0%至9.9%。与未结合SpecReason的推测解码相比,两者结合进一步减少了19.4%至44.2%的延迟。我们在https://github.com/ruipeterpan/specreason开源了SpecReason。
English
Recent advances in inference-time compute have significantly improved
performance on complex tasks by generating long chains of thought (CoTs) using
Large Reasoning Models (LRMs). However, this improved accuracy comes at the
cost of high inference latency due to the length of generated reasoning
sequences and the autoregressive nature of decoding. Our key insight in
tackling these overheads is that LRM inference, and the reasoning that it
embeds, is highly tolerant of approximations: complex tasks are typically
broken down into simpler steps, each of which brings utility based on the
semantic insight it provides for downstream steps rather than the exact tokens
it generates. Accordingly, we introduce SpecReason, a system that automatically
accelerates LRM inference by using a lightweight model to (speculatively) carry
out simpler intermediate reasoning steps and reserving the costly base model
only to assess (and potentially correct) the speculated outputs. Importantly,
SpecReason's focus on exploiting the semantic flexibility of thinking tokens in
preserving final-answer accuracy is complementary to prior speculation
techniques, most notably speculative decoding, which demands token-level
equivalence at each step. Across a variety of reasoning benchmarks, SpecReason
achieves 1.5-2.5times speedup over vanilla LRM inference while improving
accuracy by 1.0-9.9\%. Compared to speculative decoding without SpecReason,
their combination yields an additional 19.4-44.2\% latency reduction. We
open-source SpecReason at https://github.com/ruipeterpan/specreason.Summary
AI-Generated Summary