텐서곱 어텐션만 있으면 충분합니다.

초록

긴 입력 시퀀스를 처리하기 위해 언어 모델의 규모를 확장하는 것은 일반적으로 큰 키-값 (KV) 캐시를 필요로 하며, 추론 중에 상당한 메모리 오버헤드를 초래합니다. 본 논문에서는 Tensor Product Attention (TPA)이라는 새로운 어텐션 메커니즘을 제안합니다. 이 메커니즘은 텐서 분해를 사용하여 쿼리, 키 및 값들을 콤팩트하게 표현하며, 추론 시에 KV 캐시 크기를 크게 줄입니다. 이러한 표현을 문맥적 저랭크 구성 요소로 분해하고 RoPE와 원활하게 통합함으로써, TPA는 모델 품질을 향상시키면서 메모리 효율성을 달성합니다. TPA를 기반으로 시퀀스 모델링을 위한 새로운 모델 아키텍처인 Tensor ProducT ATTenTion Transformer (T6)을 소개합니다. 언어 모델링 작업의 광범위한 경험적 평가를 통해, T6이 퍼플렉서티와 다양한 유명한 평가 기준을 포함한 여러 메트릭에서 MHA, MQA, GQA 및 MLA를 포함한 표준 Transformer 기준을 능가함을 입증합니다. 특히, TPA의 메모리 효율성은 현대 언어 모델에서 중요한 확장 가능성 도전 과제를 해결하여 고정된 자원 제약 하에 상당히 긴 시퀀스를 처리할 수 있게 합니다. 코드는 https://github.com/tensorgi/T6에서 확인할 수 있습니다.

English

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, significantly shrinking KV cache size at inference time. By factorizing these representations into contextual low-rank components (contextual factorization) and seamlessly integrating with RoPE, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation of language modeling tasks, we demonstrate that T6 exceeds the performance of standard Transformer baselines including MHA, MQA, GQA, and MLA across various metrics, including perplexity and a range of renowned evaluation benchmarks. Notably, TPAs memory efficiency enables the processing of significantly longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.

텐서곱 어텐션만 있으면 충분합니다.

Tensor Product Attention Is All You Need

초록

Summary

Support