Plug-and-Play 1.x-Bit KV-Cache-Quantisierung für Video-Large-Language-Modelle

Zusammenfassung

Video Large Language Models (VideoLLMs) haben die Fähigkeit demonstriert, längere Videoeingaben zu verarbeiten und komplexe Schlussfolgerungen und Analysen zu ermöglichen. Aufgrund der Tausenden von visuellen Tokens aus den Videobildern kann der Key-Value (KV)-Cache jedoch den Speicherbedarf erheblich erhöhen und so zu einem Engpass für die Inferenzgeschwindigkeit und Speichernutzung werden. Die KV-Cache-Quantisierung ist ein weit verbreiteter Ansatz, um dieses Problem zu lösen. In diesem Artikel stellen wir fest, dass eine 2-Bit-KV-Quantisierung von VideoLLMs die Modellleistung kaum beeinträchtigt, während die Grenze der KV-Cache-Quantisierung in noch niedrigeren Bitbreiten bisher nicht untersucht wurde. Um diese Lücke zu schließen, führen wir VidKV ein, eine Plug-and-Play-Methode zur KV-Cache-Quantisierung, die den KV-Cache auf weniger als 2 Bit komprimiert. Konkret (1) schlagen wir für den Key eine gemischte Präzisionsquantisierung in der Kanaldimension vor, bei der wir eine 2-Bit-Quantisierung für anomale Kanäle und eine 1-Bit-Quantisierung in Kombination mit FFT für normale Kanäle durchführen; (2) für den Value implementieren wir eine 1,58-Bit-Quantisierung, während wir semantisch bedeutsame visuelle Tokens selektiv filtern, um sie gezielt zu erhalten, für eine bessere Balance zwischen Präzision und Modellleistung. Wichtig ist, dass unsere Ergebnisse darauf hindeuten, dass der Value-Cache von VideoLLMs kanalweise quantisiert werden sollte, anstatt tokenweise, wie es in früheren KV-Cache-Quantisierungsarbeiten für LLMs vorgeschlagen wurde. Empirisch zeigen umfangreiche Ergebnisse mit LLaVA-OV-7B und Qwen2.5-VL-7B auf sechs Benchmarks, dass VidKV den KV-Cache effektiv auf 1,5-Bit- und 1,58-Bit-Präzision komprimiert, ohne nennenswerte Leistungseinbußen im Vergleich zu den FP16-Varianten.

English

Video large language models (VideoLLMs) have demonstrated the capability to process longer video inputs and enable complex reasoning and analysis. However, due to the thousands of visual tokens from the video frames, key-value (KV) cache can significantly increase memory requirements, becoming a bottleneck for inference speed and memory usage. KV cache quantization is a widely used approach to address this problem. In this paper, we find that 2-bit KV quantization of VideoLLMs can hardly hurt the model performance, while the limit of KV cache quantization in even lower bits has not been investigated. To bridge this gap, we introduce VidKV, a plug-and-play KV cache quantization method to compress the KV cache to lower than 2 bits. Specifically, (1) for key, we propose a mixed-precision quantization strategy in the channel dimension, where we perform 2-bit quantization for anomalous channels and 1-bit quantization combined with FFT for normal channels; (2) for value, we implement 1.58-bit quantization while selectively filtering semantically salient visual tokens for targeted preservation, for a better trade-off between precision and model performance. Importantly, our findings suggest that the value cache of VideoLLMs should be quantized in a per-channel fashion instead of the per-token fashion proposed by prior KV cache quantization works for LLMs. Empirically, extensive results with LLaVA-OV-7B and Qwen2.5-VL-7B on six benchmarks show that VidKV effectively compresses the KV cache to 1.5-bit and 1.58-bit precision with almost no performance drop compared to the FP16 counterparts.

Plug-and-Play 1.x-Bit KV-Cache-Quantisierung für Video-Large-Language-Modelle

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

Zusammenfassung

Summary

Support