ChatPaper.aiChatPaper

對量化指令調整的大型語言模型進行全面評估:高達405B的實驗分析

A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

September 17, 2024
作者: Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon
cs.AI

摘要

先前的研究作品已評估了使用有限指標(如困惑度或少數基本知識任務和舊數據集)的量化LLM。此外,最近的大型模型(如Llama 3.1,高達405B)尚未受到徹底檢驗。本文評估了在各種量化方法(GPTQ、AWQ、SmoothQuant和FP8)下,從7B到405B的模型上,調整指令的LLM性能。我們使用13個基準來評估六種任務類型的性能:常識問答、知識和語言理解、遵循指令、幻覺檢測、數學和對話。我們的主要發現顯示:(1)將較大的LLM量化為與較小的FP16 LLM相似大小通常在大多數基準上表現更好,但在幻覺檢測和遵循指令方面除外;(2)性能隨著不同的量化方法、模型大小和位元寬度而顯著變化,僅權重方法通常在較大模型中產生更好的結果;(3)任務難度對由於量化而導致的準確度降低沒有顯著影響;以及(4)MT-Bench評估方法對於最近表現優異的LLM的區分能力有限。
English
Prior research works have evaluated quantized LLMs using limited metrics such as perplexity or a few basic knowledge tasks and old datasets. Additionally, recent large-scale models such as Llama 3.1 with up to 405B have not been thoroughly examined. This paper evaluates the performance of instruction-tuned LLMs across various quantization methods (GPTQ, AWQ, SmoothQuant, and FP8) on models ranging from 7B to 405B. Using 13 benchmarks, we assess performance across six task types: commonsense Q\&A, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. Our key findings reveal that (1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, except for hallucination detection and instruction following; (2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models; (3) task difficulty does not significantly impact accuracy degradation due to quantization; and (4) the MT-Bench evaluation method has limited discriminatory power among recent high-performing LLMs.

Summary

AI-Generated Summary

PDF173November 16, 2024