ChatPaper.aiChatPaper

Hogwild! 推理:通过并行注意力机制实现大规模语言模型的并发生成

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

April 8, 2025
作者: Gleb Rodionov, Roman Garipov, Alina Shutova, George Yakushev, Vage Egiazarian, Anton Sinitsin, Denis Kuznedelev, Dan Alistarh
cs.AI

摘要

大型语言模型(LLMs)已展现出通过高级推理、长文本生成及工具使用来应对日益复杂任务的能力。解决这些任务通常涉及长时间的推理计算。在人类问题解决过程中,一个常见的加速策略是协作:将问题分解为子任务、并行探索不同策略等。近期研究表明,LLMs也能通过实施明确的合作框架(如投票机制或创建可并行执行的独立子任务)来实现并行操作。然而,这些框架并非适用于所有任务类型,这限制了它们的通用性。在本研究中,我们提出了一种不同的设计思路:并行运行LLM“工作者”,允许它们通过同步更新的注意力缓存进行协调,并提示这些工作者决定如何最佳协作。我们的方法让实例能够针对当前问题自行制定协作策略,同时通过并发缓存“看到”彼此的部分进展。我们通过Hogwild!推理实现了这一方法:一个并行LLM推理引擎,其中同一LLM的多个实例在共享注意力缓存的情况下并行运行,能够“即时”访问彼此生成的标记。Hogwild!推理利用旋转位置嵌入(RoPE)避免重复计算,同时提升并行硬件利用率。我们发现,现代具备推理能力的LLMs无需额外微调即可直接使用共享的键值缓存进行推理。
English
Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

Summary

AI-Generated Summary

PDF1046April 9, 2025