张量积注意力就是你所需要的

摘要

将语言模型扩展以处理更长的输入序列通常需要大型键-值（KV）缓存，这会导致推断过程中出现大量的内存开销。在本文中，我们提出了张量积注意力（TPA），这是一种使用张量分解来紧凑表示查询、键和值的新型注意力机制，显著减小了推断时的KV缓存大小。通过将这些表示因子分解为上下文低秩组件（上下文因子分解），并与RoPE无缝集成，TPA在提高模型质量的同时实现了内存效率。基于TPA，我们引入了张量积注意力变换器（T6），这是一种用于序列建模的新模型架构。通过对语言建模任务进行广泛的实证评估，我们证明了T6在各种指标上超越了标准Transformer基线模型，包括MHA、MQA、GQA和MLA，包括困惑度和一系列著名的评估基准。值得注意的是，TPA的内存效率使得在固定资源约束下处理更长序列成为可能，解决了现代语言模型中的关键可扩展性挑战。代码可在https://github.com/tensorgi/T6找到。

English

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.

张量积注意力就是你所需要的

Tensor Product Attention Is All You Need

摘要

Summary

Support