AERO: 효율적인 개인 정보 추론을 위한 Softmax 전용 LLMs

초록

전용 언어 모델의 보급은 사용자의 민감한 데이터에 대한 개인 정보 보호 우려를 증폭시켰으며, 개인 추론(PI)의 필요성을 강조하고 있습니다. 여기서 추론은 암호화된 입력에 직접 수행되는 것입니다. 그러나 현재의 PI 방법은 비선형 연산으로 인한 통신 및 지연 오버헤드가 지나치게 높은 문제에 직면하고 있습니다. 본 논문에서는 트랜스포머 기반 디코더 전용 언어 모델에서 비선형성의 역할을 이해하기 위한 포괄적인 분석을 제시합니다. 우리는 기존 LLM 아키텍처를 효율적인 PI를 위해 개선하는 네 단계의 아키텍처 최적화 프레임워크인 AERO를 소개합니다. 이를 통해 LayerNorm과 GELU와 같은 비선형성을 체계적으로 제거하고 FLOP 수를 줄입니다. 우리는 효율적인 PI를 위해 FLOP 수를 크게 줄인 Softmax 전용 아키텍처를 처음으로 제안합니다. 더불어, Softmax 전용 모델의 성능을 향상시키기 위한 새로운 엔트로피 정규화 기술을 고안합니다. AERO는 최대 4.23배의 통신 및 1.94배의 지연 감소를 달성합니다. 우리는 AERO의 효과를 최첨단 기술과의 벤치마킹을 통해 검증합니다.

English

The pervasiveness of proprietary language models has raised privacy concerns for users' sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23times communication and 1.94times latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.

AERO: 효율적인 개인 정보 추론을 위한 Softmax 전용 LLMs

AERO: Softmax-Only LLMs for Efficient Private Inference

초록

Summary

Support