透過可微分快取擴充在潛在空間中的思考

Deliberation in Latent Space via Differentiable Cache Augmentation

December 23, 2024
作者: Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam
cs.AI

摘要

透過生成並關注中間推理步驟,使大型語言模型(LLMs)能夠「更深入思考」的技術已顯示出在解決複雜問題方面的潛力。然而,標準方法在回應前立即生成一系列離散標記,因此可能會產生顯著的延遲成本並且難以進行優化。在這項研究中,我們展示了一種凍結的LLM可以透過離線協處理器來擴充,該協處理器操作於模型的鍵-值(kv)緩存上。這個協處理器通過一組旨在改善後續解碼準確性的潛在嵌入來擴充緩存。我們使用解碼器在標準預訓練數據上的語言建模損失來訓練這個協處理器,同時保持解碼器本身凍結。這種方法使模型能夠以端到端可微分的方式學習如何將額外的計算融入其kv-緩存中。由於解碼器保持不變,協處理器可以離線和異步操作,如果協處理器不可用或者特定緩存被認為不需要額外計算,語言模型可以正常運作。我們實驗性地展示,當緩存被擴充時,解碼器在許多後續標記上實現更低的困惑度。此外,即使沒有任何特定任務的訓練,我們的實驗表明,緩存擴充始終能夠降低困惑度並改善在一系列需要推理的任務中的性能。
English
Techniques enabling large language models (LLMs) to "think more" by generating and attending to intermediate reasoning steps have shown promise in solving complex problems. However, the standard approaches generate sequences of discrete tokens immediately before responding, and so they can incur significant latency costs and be challenging to optimize. In this work, we demonstrate that a frozen LLM can be augmented with an offline coprocessor that operates on the model's key-value (kv) cache. This coprocessor augments the cache with a set of latent embeddings designed to improve the fidelity of subsequent decoding. We train this coprocessor using the language modeling loss from the decoder on standard pretraining data, while keeping the decoder itself frozen. This approach enables the model to learn, in an end-to-end differentiable fashion, how to distill additional computation into its kv-cache. Because the decoder remains unchanged, the coprocessor can operate offline and asynchronously, and the language model can function normally if the coprocessor is unavailable or if a given cache is deemed not to require extra computation. We show experimentally that when a cache is augmented, the decoder achieves lower perplexity on numerous subsequent tokens. Furthermore, even without any task-specific training, our experiments demonstrate that cache augmentation consistently reduces perplexity and improves performance across a range of reasoning-intensive tasks.

Summary

AI-Generated Summary

PDF295December 24, 2024