ChatPaper.aiChatPaper

谜题:基于蒸馏的用于推理优化的LLM的NAS

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

November 28, 2024
作者: Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv
cs.AI

摘要

大型语言模型(LLMs)展示了卓越的能力,但在推理过程中高计算成本限制了它们的采用。增加参数数量可以提高准确性,但也加大了最先进能力与实际部署能力之间的差距。我们提出了Puzzle,一个加速在特定硬件上进行LLM推理的框架,同时保留它们的能力。通过在前所未有的规模上创新地应用神经架构搜索(NAS),Puzzle系统地优化了在硬件约束下具有数百亿参数的模型。我们的方法利用分块局部知识蒸馏(BLD)进行并行架构探索,并采用混合整数规划进行精确的约束优化。 我们通过Llama-3.1-Nemotron-51B-Instruct(Nemotron-51B)展示了我们框架的实际影响,这是从Llama-3.1-70B-Instruct衍生出的一个公开可用模型。Nemotron-51B实现了2.17倍的推理吞吐量加速,在单个NVIDIA H100 GPU上运行,同时保留了原模型98.4%的能力。Nemotron-51B目前是最准确的语言模型之一,能够在单个GPU上进行推理,且具有大批量大小。值得注意的是,这种转变仅需要45B的训练标记,而70B模型需要超过15T的标记。这确立了一个新的范式,即强大的模型可以被优化以实现高效部署,而几乎不会牺牲其能力,这表明推理性能,而不仅仅是参数数量,应该指导模型选择。随着Nemotron-51B的发布和Puzzle框架的展示,我们为从业者提供了立即访问最先进语言建模能力的机会,而计算成本大大降低。
English
Large language models (LLMs) have demonstrated remarkable capabilities, but their adoption is limited by high computational costs during inference. While increasing parameter counts enhances accuracy, it also widens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a framework to accelerate LLM inference on specific hardware while preserving their capabilities. Through an innovative application of neural architecture search (NAS) at an unprecedented scale, Puzzle systematically optimizes models with tens of billions of parameters under hardware constraints. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We demonstrate the real-world impact of our framework through Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B), a publicly available model derived from Llama-3.1-70B-Instruct. Nemotron-51B achieves a 2.17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while preserving 98.4% of the original model's capabilities. Nemotron-51B currently stands as the most accurate language model capable of inference on a single GPU with large batch sizes. Remarkably, this transformation required just 45B training tokens, compared to over 15T tokens used for the 70B model it was derived from. This establishes a new paradigm where powerful models can be optimized for efficient deployment with only negligible compromise of their capabilities, demonstrating that inference performance, not parameter count alone, should guide model selection. With the release of Nemotron-51B and the presentation of the Puzzle framework, we provide practitioners immediate access to state-of-the-art language modeling capabilities at significantly reduced computational costs.

Summary

AI-Generated Summary

PDF172December 2, 2024