透過可微分快取擴充在潛在空間中的思考
Deliberation in Latent Space via Differentiable Cache Augmentation
December 23, 2024
作者: Luyang Liu, Jonas Pfeiffer, Jiaxing Wu, Jun Xie, Arthur Szlam
cs.AI
摘要
透過生成並關注中間推理步驟,使大型語言模型(LLMs)能夠「更深入思考」的技術已顯示出在解決複雜問題方面的潛力。然而,標準方法在回應前立即生成一系列離散標記,因此可能會產生顯著的延遲成本並且難以進行優化。在這項研究中,我們展示了一種凍結的LLM可以透過離線協處理器來擴充,該協處理器操作於模型的鍵-值(kv)緩存上。這個協處理器通過一組旨在改善後續解碼準確性的潛在嵌入來擴充緩存。我們使用解碼器在標準預訓練數據上的語言建模損失來訓練這個協處理器,同時保持解碼器本身凍結。這種方法使模型能夠以端到端可微分的方式學習如何將額外的計算融入其kv-緩存中。由於解碼器保持不變,協處理器可以離線和異步操作,如果協處理器不可用或者特定緩存被認為不需要額外計算,語言模型可以正常運作。我們實驗性地展示,當緩存被擴充時,解碼器在許多後續標記上實現更低的困惑度。此外,即使沒有任何特定任務的訓練,我們的實驗表明,緩存擴充始終能夠降低困惑度並改善在一系列需要推理的任務中的性能。
English
Techniques enabling large language models (LLMs) to "think more" by
generating and attending to intermediate reasoning steps have shown promise in
solving complex problems. However, the standard approaches generate sequences
of discrete tokens immediately before responding, and so they can incur
significant latency costs and be challenging to optimize. In this work, we
demonstrate that a frozen LLM can be augmented with an offline coprocessor that
operates on the model's key-value (kv) cache. This coprocessor augments the
cache with a set of latent embeddings designed to improve the fidelity of
subsequent decoding. We train this coprocessor using the language modeling loss
from the decoder on standard pretraining data, while keeping the decoder itself
frozen. This approach enables the model to learn, in an end-to-end
differentiable fashion, how to distill additional computation into its
kv-cache. Because the decoder remains unchanged, the coprocessor can operate
offline and asynchronously, and the language model can function normally if the
coprocessor is unavailable or if a given cache is deemed not to require extra
computation. We show experimentally that when a cache is augmented, the decoder
achieves lower perplexity on numerous subsequent tokens. Furthermore, even
without any task-specific training, our experiments demonstrate that cache
augmentation consistently reduces perplexity and improves performance across a
range of reasoning-intensive tasks.Summary
AI-Generated Summary