"BF16를 주시거나 죽음을 주시오"? LLM에서의 정확성-성능 트레이드오프를 언급합니다.

초록

대규모 언어 모델 (LLM) 양자화의 인퍼런스 가속화에 대한 인기에도 불구하고, 다양한 양자화 형식과 관련된 정확성-성능 교환에 대한 상당한 불확실성이 남아 있습니다. 본 연구에서는 학술 벤치마크와 실제 과제에서 인기 있는 양자화 형식 (FP8, INT8, INT4)을 평가하여 전체 Llama-3.1 모델 패밀리에서 양자화된 정확성에 대한 포괄적인 경험적 연구를 제시합니다. 또한 본 연구는 양자화된 모델과 압축 해제된 대조군 간의 생성된 텍스트의 차이를 조사합니다. 벤치마크 이상으로, 우리는 최첨단 정확성 회복 결과를 얻을 수 있도록 한 몇 가지 양자화 개선을 제시합니다. 50만 개 이상의 개별 평가를 포괄하는 우리의 조사는 여러 가지 주요 결과를 도출합니다: (1) FP8 가중치 및 활성화 양자화 (W8A8-FP)는 모든 모델 규모에서 손실이 없습니다, (2) INT8 가중치 및 활성화 양자화 (W8A8-INT)는 적절하게 조정된 경우 놀랍도록 1-3%의 정확도 저하를 초래하며, (3) INT4 가중치 전용 양자화 (W4A16-INT)는 8비트 정수 가중치와 활성화 양자화와 경쟁력이 있습니다. 특정 배포 환경에 대한 "최상의" 형식에 대한 문제를 해결하기 위해 우리는 인기 있는 오픈 소스 vLLM 프레임워크를 사용하여 다양한 GPU 아키텍처에서 인퍼런스 성능 분석을 수행합니다. 우리는 W4A16이 동기식 배포에 가장 좋은 비용 효율성을 제공하며 중간급 GPU에서 비동기식 배포에 적합함을 발견합니다. 동시에, W8A8 형식은 고급 GPU에서 중간 및 대형 모델의 비동기식 "연속 배치" 배포에서 뛰어납니다. 우리의 결과는 다양한 규모와 성능 요구 사항에 걸쳐 양자화된 LLM을 배포하는 실용적인 지침 세트를 제공합니다.

English

Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

"BF16를 주시거나 죽음을 주시오"? LLM에서의 정확성-성능 트레이드오프를 언급합니다.

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

초록

Support