Hymba：用于小型语言模型的混合头结构

摘要

我们提出了Hymba，这是一系列小型语言模型，具有混合头并行架构，将Transformer注意力机制与状态空间模型（SSMs）相结合，以提高效率。注意力头提供高分辨率的召回，而SSM头实现了高效的上下文摘要。此外，我们引入了可学习的元记号，这些记号被添加到提示之前，存储关键信息并减轻与注意力机制相关的“被迫关注”的负担。通过整合跨层键-值（KV）共享和部分滑动窗口注意力，进一步优化了该模型，从而实现了紧凑的缓存大小。在开发过程中，我们进行了一项受控研究，比较了在相同设置下的各种架构，并观察到我们提出的架构具有显著优势。值得注意的是，Hymba在小型语言模型方面取得了最先进的结果：我们的Hymba-1.5B-Base模型在性能上超越了所有低于2B的公共模型，甚至在准确率上比Llama-3.2-3B高出1.32％，缓存大小减少了11.67倍，吞吐量提高了3.49倍。

English

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Hymba：用于小型语言模型的混合头结构

Hymba: A Hybrid-head Architecture for Small Language Models

摘要

Support