SwiftKV：具有快速預填充優化推論和保留知識的模型轉換

摘要

對於流行的企業使用案例，如摘要、RAG和代碼生成等，LLM 推論通常觀察到比生成長度長數量級的提示長度。這種特性導致預填充的高成本和增加的響應延遲。在本文中，我們提出了 SwiftKV，一種新穎的模型轉換和蒸餾程序，專門設計來減少處理提示標記的時間和成本，同時保持生成標記的高質量。SwiftKV 結合了三個關鍵機制：i) SingleInputKV，使用較早層的輸出來預先填充後續層的 KV 快取，使提示標記可以跳過大部分模型計算，ii) AcrossKV，合併相鄰層的 KV 快取以減少內存佔用並支持更大的批量大小以提高吞吐量，以及 iii) 一種保持知識的蒸餾程序，可以以最小的準確性影響和低計算和數據需求將現有的 LLM 調整為 SwiftKV。對於 Llama-3.1-8B 和 70B，SwiftKV 將預填充的計算需求降低了 50%，將 KV 快取的內存需求降低了 62.5%，同時在廣泛的任務範圍內產生最小的質量降級。在使用優化的 vLLM 實現的端到端推理服務中，SwiftKV 實現了高達 2 倍的總吞吐量和每個輸出標記的時間降低了 60%。它可以實現驚人的 560 TFlops/GPU 的標準化推理吞吐量，這對應於在 4x H100 GPU 上以 16 位精度為 Llama-3.1-70B 每秒 16K 標記。

English

LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers' KV cache using a much earlier layer's output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs.

SwiftKV：具有快速預填充優化推論和保留知識的模型轉換

SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

摘要

Summary

Support

Support