Hymba：用於小型語言模型的混合式架構

摘要

我們提出了 Hymba，這是一系列小型語言模型，具有混合式平行架構，將 Transformer 的注意力機制與狀態空間模型（SSMs）相結合，以提高效率。注意力頭提供高分辨率的回憶，而 SSM 頭使得上下文摘要更有效率。此外，我們引入了可學習的元記號，附加在提示語前，儲存關鍵信息，減輕了與注意力機制相關的「被迫參與」負擔。通過整合跨層鍵值（KV）共享和部分滑動窗口注意力，進一步優化了此模型，使得快取大小更為緊湊。在開發過程中，我們進行了一項受控研究，比較了在相同設置下的各種架構，觀察到我們提出的架構具有顯著優勢。值得注意的是，Hymba 在小型語言模型方面取得了最新成果：我們的 Hymba-1.5B-Base 模型在性能上超越了所有小於 2B 的公共模型，甚至優於 Llama-3.2-3B，平均準確度提高了 1.32%，快取大小減少了 11.67 倍，吞吐量提高了 3.49 倍。

English

We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Hymba：用於小型語言模型的混合式架構

Hymba: A Hybrid-head Architecture for Small Language Models

摘要

Support