트랜스포머에서 중요한 것은 무엇인가? 모든 주의력이 필요한 것은 아니다.

초록

Transformer 기반 대형 언어 모델(LLM)의 확장은 다양한 작업에서 유망한 성능을 보여주었지만, 실제 배포에서 효율성 문제를 일으키는 중복 구조를 도입하기도 합니다. LLM의 중복성을 인식하는 것은 있지만, MLP와 Attention 레이어와 같은 트랜스포머 내 다양한 아키텍처 간 중복성의 변이는 미개척된 상태입니다. 본 연구에서는 유사성 기반 메트릭을 사용하여 트랜스포머 내 다양한 모듈, 즉 블록, MLP 및 Attention 레이어 간의 중복성을 조사합니다. 트랜스포머를 다른 아키텍처와 구별 짓는 주요 역할에도 불구하고, 우리는 Attention 레이어의 상당 부분이 지나치게 높은 유사성을 나타내며 성능 하락 없이 제거될 수 있다는 것을 발견했습니다. 예를 들어, Llama-2-70B는 Attention 레이어의 절반을 제거함으로써 성능 하락이 2.4%에 그치면서 48.4%의 가속화를 달성했습니다. 또한, 모델 체크포인트를 추적하여 훈련 과정 전반에 걸쳐 Attention 레이어의 중복성이 본질적이고 일관되게 나타나는 것을 관찰했습니다. 게다가, Attention 및 MLP 레이어를 동시에 제거하는 방법을 제안하여 추가적인 레이어를 보다 적극적으로 제거할 수 있도록 합니다. 예를 들어, 31개의 레이어(Attention + MLP)를 제거할 때, Llama-2-13B는 MMLU 작업에서 성능의 90%를 유지합니다. 우리의 연구는 미래 네트워크 아키텍처 설계에 대한 가치 있는 통찰을 제공합니다. 코드는 다음에서 공개되었습니다: https://github.com/Shwai-He/LLM-Drop.

English

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: https://github.com/Shwai-He/LLM-Drop.

트랜스포머에서 중요한 것은 무엇인가? 모든 주의력이 필요한 것은 아니다.

What Matters in Transformers? Not All Attention is Needed

초록

Summary

Support