ChatPaper.aiChatPaper

學習語言模型的自適應平行推理

Learning Adaptive Parallel Reasoning with Language Models

April 21, 2025
作者: Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr
cs.AI

摘要

擴展推理時的計算能力已顯著提升了語言模型的推理能力。然而,現有方法存在顯著限制:串行的思維鏈方法生成過長的輸出,導致延遲增加和上下文窗口耗盡,而並行方法如自我一致性則因協調不足而產生冗餘計算,限制了性能提升。為解決這些不足,我們提出了自適應並行推理(Adaptive Parallel Reasoning, APR),這是一種新穎的推理框架,使語言模型能夠端到端地協調串行和並行計算。APR通過使用spawn()和join()操作實現自適應多線程推理,從而推廣了現有的推理方法。一個關鍵創新是我們的端到端強化學習策略,優化父線程和子線程的推理,以提高任務成功率,而無需預定義的推理結構。在倒計時推理任務上的實驗展示了APR的顯著優勢:(1) 在相同上下文窗口內更高的性能(4k上下文時83.4% vs. 60.0%);(2) 隨著計算量增加,具有更優的可擴展性(20k總token時80.1% vs. 66.6%);(3) 在相同延遲下更高的準確率(約5,000ms時75.2% vs. 57.3%)。APR代表了語言模型通過自適應分配計算來自主優化其推理過程的一步。
English
Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.

Summary

AI-Generated Summary

PDF422April 23, 2025