Hymba:用於小型語言模型的混合式架構
Hymba: A Hybrid-head Architecture for Small Language Models
November 20, 2024
作者: Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov
cs.AI
摘要
我們提出了 Hymba,這是一系列小型語言模型,具有混合式平行架構,將 Transformer 的注意力機制與狀態空間模型(SSMs)相結合,以提高效率。注意力頭提供高分辨率的回憶,而 SSM 頭使得上下文摘要更有效率。此外,我們引入了可學習的元記號,附加在提示語前,儲存關鍵信息,減輕了與注意力機制相關的「被迫參與」負擔。通過整合跨層鍵值(KV)共享和部分滑動窗口注意力,進一步優化了此模型,使得快取大小更為緊湊。在開發過程中,我們進行了一項受控研究,比較了在相同設置下的各種架構,觀察到我們提出的架構具有顯著優勢。值得注意的是,Hymba 在小型語言模型方面取得了最新成果:我們的 Hymba-1.5B-Base 模型在性能上超越了所有小於 2B 的公共模型,甚至優於 Llama-3.2-3B,平均準確度提高了 1.32%,快取大小減少了 11.67 倍,吞吐量提高了 3.49 倍。
English
We propose Hymba, a family of small language models featuring a hybrid-head
parallel architecture that integrates transformer attention mechanisms with
state space models (SSMs) for enhanced efficiency. Attention heads provide
high-resolution recall, while SSM heads enable efficient context summarization.
Additionally, we introduce learnable meta tokens that are prepended to prompts,
storing critical information and alleviating the "forced-to-attend" burden
associated with attention mechanisms. This model is further optimized by
incorporating cross-layer key-value (KV) sharing and partial sliding window
attention, resulting in a compact cache size. During development, we conducted
a controlled study comparing various architectures under identical settings and
observed significant advantages of our proposed architecture. Notably, Hymba
achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model
surpasses all sub-2B public models in performance and even outperforms
Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size
reduction, and 3.49x throughput.Summary
AI-Generated Summary