通过单序列内的并行解码加速可并行化推理
Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
March 26, 2025
作者: Yijiong Yu
cs.AI
摘要
近期推理模型的进展,特别是在数学推理等复杂任务上,通过采用详尽全面的推理过程,显著提升了准确性。然而,生成这些冗长的推理序列计算成本高且耗时。为解决这一效率问题,我们利用某些任务固有的并行性来加速推理过程。具体而言,当存在多个并行推理分支时,我们使用专门的注意力掩码在每一步解码多个标记,并在单一序列中处理它们,从而避免了额外的内存占用。实验结果表明,我们的方法在保持答案质量的同时,解码时间实现了超过100%的加速。
English
Recent advances in reasoning models have demonstrated significant
improvements in accuracy, particularly for complex tasks such as mathematical
reasoning, by employing detailed and comprehensive reasoning processes.
However, generating these lengthy reasoning sequences is computationally
expensive and time-consuming. To address this inefficiency, we leverage the
inherent parallelizability of certain tasks to accelerate the reasoning
process. Specifically, when multiple parallel reasoning branches exist, we
decode multiple tokens per step using a specialized attention mask, processing
them within a single sequence, avoiding additional memory usage. Experimental
results show that our method achieves over 100% speedup in decoding time while
maintaining the answer quality.Summary
AI-Generated Summary