利用VectorQ进行自适应语义提示缓存

摘要

语义提示缓存通过重用缓存的大型语言模型（LLM）生成的响应来减少推理的延迟和成本，用于语义上相似提示。向量相似度度量为嵌入提示与其在缓存中最近邻居之间的相似度分配一个数值分数。现有系统依赖于静态阈值来分类相似度分数是否足够高以导致缓存命中。我们表明，这种一刀切的阈值在不同提示之间是不够的。我们提出了VectorQ，一个学习嵌入特定阈值区域的框架，以适应嵌入的复杂性和不确定性。通过对四个不同数据集的组合进行评估，我们展示了VectorQ在所有静态阈值上始终优于最先进系统，缓存命中率最多提高了12倍，错误率降低高达92%。

English

Semantic prompt caches reduce the latency and cost of large language model (LLM) inference by reusing cached LLM-generated responses for semantically similar prompts. Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache. Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit. We show that this one-size-fits-all threshold is insufficient across different prompts. We propose VectorQ, a framework to learn embedding-specific threshold regions that adapt to the complexity and uncertainty of an embedding. Through evaluations on a combination of four diverse datasets, we show that VectorQ consistently outperforms state-of-the-art systems across all static thresholds, achieving up to 12x increases in cache hit rate and error rate reductions up to 92%.

利用VectorQ进行自适应语义提示缓存

Adaptive Semantic Prompt Caching with VectorQ

摘要

Summary

Support