テンソル積注意機構はすべてを必要とする

要旨

言語モデルを長い入力シーケンスに対応させるためには通常、大規模なキー・値（KV）キャッシュが必要となり、推論時に膨大なメモリオーバーヘッドが発生します。本論文では、テンソル積注意（TPA）という新しい注意メカニズムを提案し、テンソル分解を使用してクエリ、キー、値をコンパクトに表現し、推論時のKVキャッシュサイズを大幅に縮小します。これらの表現を文脈に応じた低ランク成分（文脈的分解）に因数分解し、RoPEとシームレスに統合することで、TPAはモデルの品質向上とメモリ効率を実現します。TPAに基づいて、シーケンスモデリングのための新しいモデルアーキテクチャであるTensor ProducT ATTenTion Transformer（T6）を紹介します。言語モデリングタスクの包括的な実証評価を通じて、T6がパープレキシティやさまざまな評価ベンチマークを含むさまざまなメトリクスで、MHA、MQA、GQA、MLAなどの標準的なTransformerベースラインの性能を上回ることを示します。特に、TPAのメモリ効率により、現代の言語モデルにおける重要なスケーラビリティの課題を解決し、固定されたリソース制約下で著しく長いシーケンスの処理が可能となります。コードはhttps://github.com/tensorgi/T6 で入手可能です。

English

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.

テンソル積注意機構はすべてを必要とする

Tensor Product Attention Is All You Need

要旨

Support