语言模型的自适应并行推理学习

摘要

推理时计算规模的扩展显著提升了语言模型的推理能力。然而，现有方法存在明显局限：串行化的链式思维方法生成过长的输出，导致延迟增加和上下文窗口耗尽；而并行方法如自洽性则因协调不足，造成冗余计算和性能提升有限。为应对这些不足，我们提出了自适应并行推理（APR），一种新颖的推理框架，使语言模型能够端到端地编排串行与并行计算。APR通过启用基于spawn()和join()操作的自适应多线程推理，泛化了现有推理方法。其核心创新在于端到端的强化学习策略，优化父线程与子线程的推理，无需预定义推理结构即可提升任务成功率。在倒计时推理任务上的实验验证了APR的显著优势：(1) 在相同上下文窗口下性能更高（4k上下文时83.4%对60.0%）；(2) 计算量增加时展现出更优的扩展性（总token数20k时80.1%对66.6%）；(3) 在同等延迟下准确率提升（约5,000ms时75.2%对57.3%）。APR标志着语言模型通过自适应计算分配自主优化其推理过程的重要一步。

English

Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.

语言模型的自适应并行推理学习

Learning Adaptive Parallel Reasoning with Language Models

摘要

Summary

Support

Support