ChatPaper.aiChatPaper

AERO:用於高效私密推論的僅Softmax LLMs

AERO: Softmax-Only LLMs for Efficient Private Inference

October 16, 2024
作者: Nandan Kumar Jha, Brandon Reagen
cs.AI

摘要

專有語言模型的普及性引發了使用者對敏感數據隱私的擔憂,強調了私密推論(PI)的必要性,其中推論直接在加密輸入上執行。然而,目前的PI方法面臨著過高的通信和延遲開銷,主要是由於非線性操作。本文提出了一項全面分析,以了解基於Transformer解碼器的語言模型中非線性的作用。我們介紹了AERO,一個四步驟的架構優化框架,通過系統地去除諸如LayerNorm和GELU之類的非線性,並減少FLOPs計數,來優化現有的LLM架構,以實現有效的PI。我們首次提出了一種僅使用Softmax的架構,具有明顯較少的FLOPs,適用於高效的PI。此外,我們設計了一種新穎的熵正則化技術,以提高Softmax-only模型的性能。AERO實現了高達4.23倍的通信和1.94倍的延遲減少。我們通過與最先進技術的基準測試,驗證了AERO的有效性。
English
The pervasiveness of proprietary language models has raised privacy concerns for users' sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23times communication and 1.94times latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.

Summary

AI-Generated Summary

PDF42November 16, 2024