Hymba:用於小型語言模型的混合式架構

Hymba: A Hybrid-head Architecture for Small Language Models

November 20, 2024
作者: Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov
cs.AI

摘要

我們提出了 Hymba,這是一系列小型語言模型,具有混合式平行架構,將 Transformer 的注意力機制與狀態空間模型(SSMs)相結合,以提高效率。注意力頭提供高分辨率的回憶,而 SSM 頭使得上下文摘要更有效率。此外,我們引入了可學習的元記號,附加在提示語前,儲存關鍵信息,減輕了與注意力機制相關的「被迫參與」負擔。通過整合跨層鍵值(KV)共享和部分滑動窗口注意力,進一步優化了此模型,使得快取大小更為緊湊。在開發過程中,我們進行了一項受控研究,比較了在相同設置下的各種架構,觀察到我們提出的架構具有顯著優勢。值得注意的是,Hymba 在小型語言模型方面取得了最新成果:我們的 Hymba-1.5B-Base 模型在性能上超越了所有小於 2B 的公共模型,甚至優於 Llama-3.2-3B,平均準確度提高了 1.32%,快取大小減少了 11.67 倍,吞吐量提高了 3.49 倍。
English
We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency. Attention heads provide high-resolution recall, while SSM heads enable efficient context summarization. Additionally, we introduce learnable meta tokens that are prepended to prompts, storing critical information and alleviating the "forced-to-attend" burden associated with attention mechanisms. This model is further optimized by incorporating cross-layer key-value (KV) sharing and partial sliding window attention, resulting in a compact cache size. During development, we conducted a controlled study comparing various architectures under identical settings and observed significant advantages of our proposed architecture. Notably, Hymba achieves state-of-the-art results for small LMs: Our Hymba-1.5B-Base model surpasses all sub-2B public models in performance and even outperforms Llama-3.2-3B with 1.32% higher average accuracy, an 11.67x cache size reduction, and 3.49x throughput.

Summary

AI-Generated Summary

PDF413November 22, 2024