通过可微缓存增强在潜空间中的审慎思考

摘要

通过生成和关注中间推理步骤，使大型语言模型（LLMs）能够“更深入”思考的技术已经显示出解决复杂问题的潜力。然而，标准方法在立即响应之前生成离散标记序列，因此可能会产生显著的延迟成本，并且很难进行优化。在这项工作中，我们展示了一个冻结的LLM可以通过离线协处理器来增强，该协处理器操作模型的键-值（kv）缓存。这个协处理器使用一组潜在嵌入来增强缓存，旨在提高后续解码的准确性。我们使用解码器在标准预训练数据上的语言建模损失来训练这个协处理器，同时保持解码器本身冻结。这种方法使模型能够以端到端可微分的方式学习如何将额外的计算蒸馏到其kv-缓存中。由于解码器保持不变，协处理器可以离线和异步操作，如果协处理器不可用或者认为某个缓存不需要额外计算，语言模型可以正常运行。我们通过实验证明，当缓存被增强时，解码器在许多后续标记上实现了较低的困惑度。此外，即使没有任何特定任务的训练，我们的实验表明，缓存增强始终降低困惑度，并在一系列需要推理的任务中提高性能。

English

Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.

通过可微缓存增强在潜空间中的审慎思考

Deliberation in Latent Space via Differentiable Cache Augmentation

摘要

Support