Hymba: 소형 언어 모델을 위한 하이브리드 헤드 구조

초록

우리는 Hymba를 제안합니다. Hymba는 트랜스포머 어텐션 메커니즘을 상태 공간 모델(SSM)과 통합한 하이브리드 헤드 병렬 아키텍처를 갖춘 소규모 언어 모델 패밀리입니다. 어텐션 헤드는 고해상도 회상을 제공하며, SSM 헤드는 효율적인 문맥 요약을 가능하게 합니다. 더불어, 프롬프트 앞에 추가되는 학습 가능한 메타 토큰을 도입하여 주요 정보를 저장하고 어텐션 메커니즘과 관련된 "강제로 주목해야 하는" 부담을 완화합니다. 이 모델은 교차 레이어 키-값(KV) 공유와 부분 슬라이딩 윈도우 어텐션을 통합하여 캐시 크기를 조밀하게 만드는 최적화가 추가되었습니다. 개발 과정에서 우리는 동일한 설정 하에서 다양한 아키텍처를 비교하는 통제된 연구를 실시했고, 우리가 제안한 아키텍처의 중요한 장점을 관찰했습니다. 특히, Hymba는 소규모 LM에서 최첨단 결과를 달성합니다. Hymba-1.5B-Base 모델은 모든 sub-2B 공개 모델을 성능 면에서 능가하며, Llama-3.2-3B보다 평균 정확도가 1.32% 더 높고, 캐시 크기는 11.67배 줄이고, 처리량은 3.49배 향상되었습니다.

English

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Hymba: 소형 언어 모델을 위한 하이브리드 헤드 구조

Hymba: A Hybrid-head Architecture for Small Language Models

초록

Summary

Support