Hymba:用于小型语言模型的混合头结构

Hymba: A Hybrid-head Architecture for Small Language Models

November 20, 2024
作者: Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov
cs.AI

摘要

我们提出了Hymba,这是一系列小型语言模型,具有混合头并行架构,将Transformer注意力机制与状态空间模型(SSMs)相结合,以提高效率。注意力头提供高分辨率的召回,而SSM头实现了高效的上下文摘要。此外,我们引入了可学习的元记号,这些记号被添加到提示之前,存储关键信息并减轻与注意力机制相关的“被迫关注”的负担。通过整合跨层键-值(KV)共享和部分滑动窗口注意力,进一步优化了该模型,从而实现了紧凑的缓存大小。在开发过程中,我们进行了一项受控研究,比较了在相同设置下的各种架构,并观察到我们提出的架构具有显著优势。值得注意的是,Hymba在小型语言模型方面取得了最先进的结果:我们的Hymba-1.5B-Base模型在性能上超越了所有低于2B的公共模型,甚至在准确率上比Llama-3.2-3B高出1.32%,缓存大小减少了11.67倍,吞吐量提高了3.49倍。
English
We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Summary

AI-Generated Summary

PDF232November 22, 2024