에너지 효율적인 언어 모델을 위해 덧셈이 필요한 모든 것

초록

대규모 신경망은 부동 소수점 텐서 곱셈에 대부분의 계산을 사용합니다. 본 연구에서는 부동 소수점 곱셈기를 고정 소수점 덧셈기 하나로 고정 소수점 정밀도로 근사할 수 있다는 것을 발견했습니다. 우리는 정수 덧셈 연산으로 부동 소수점 수 곱셈을 근사하는 선형 복잡도 곱셈 L-Mul 알고리즘을 제안합니다. 이 새로운 알고리즘은 8비트 부동 소수점 곱셈보다 계산 리소스를 상당히 적게 소비하지만 더 높은 정밀도를 달성합니다. 제안된 방법은 8비트 부동 소수점 곱셈보다 더 높은 정밀도를 달성하지만 비트 수준 계산을 상당히 적게 사용합니다. 부동 소수점 수를 곱하는 것은 정수 덧셈 연산에 비해 상당히 높은 에너지가 필요하므로, 텐서 처리 하드웨어에서 L-Mul 연산을 적용하면 원소별 부동 소수점 텐서 곱셈의 95% 에너지 비용과 닷 프로덕트의 80% 에너지 비용을 절감할 수 있습니다. 우리는 L-Mul의 이론적 오차 기대치를 계산하고, 자연어 이해, 구조적 추론, 수학, 상식적 질문 응답을 포함한 다양한 텍스트, 시각 및 상징적 작업에서 알고리즘을 평가했습니다. 우리의 수치 분석 실험은 L-Mul의 4비트 마니사를 사용하면 float8_e4m3 곱셈과 비교 가능한 정밀도를 달성하며, 3비트 마니사를 사용한 L-Mul이 float8_e5m2를 능가한다는 이론적 오차 추정과 일치합니다. 인기 있는 벤치마크에서의 평가 결과는 L-Mul을 직접 주의 메커니즘에 적용하면 거의 손실이 없다는 것을 보여줍니다. 또한 트랜스포머 모델에서 모든 부동 소수점 곱셈을 3비트 마니사 L-Mul로 대체하면 미세 조정 및 추론 모두에서 float8_e4m3을 누적 정밀도로 사용하는 것과 동등한 정밀도를 달성할 수 있음을 보여줍니다.

English

Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication L-Mul algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by element-wise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8_e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8_e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8_e4m3 as accumulation precision in both fine-tuning and inference.

에너지 효율적인 언어 모델을 위해 덧셈이 필요한 모든 것

Addition is All You Need for Energy-efficient Language Models

초록

Summary

Support

Support