基於熵的注意力導向私密LLMs

摘要

專有語言模型的普及引起了關鍵的隱私擔憂，需要在私密推論（PI）方面取得進展，即在不洩漏用戶敏感信息的情況下直接對加密數據執行計算。儘管私密推論提供了一個有前途的解決方案，但其實際部署受到大量通信和延遲開銷的阻礙，主要源於非線性操作。為了解決這個問題，我們引入了一個信息理論框架，來描述解碼器專用語言模型中非線性的作用，為優化針對私密推論需求的Transformer架構奠定了基礎。通過利用香農熵作為量化指標，我們揭示了非線性的雙重重要性，這是以前未曾探索的：除了確保訓練穩定性外，它們對於保持注意力頭多樣性至關重要。具體而言，我們發現它們的移除會觸發兩種關鍵失敗模式：深層中的「熵崩潰」破壞了訓練，以及早期層中的「熵過載」導致多頭注意力（MHA）表徵能力的低利用率。我們提出了一種以熵為指導的注意機制，配合一種新穎的熵正則化技術，以減輕熵過載。此外，我們探索了對抗熵崩潰並穩定具有降低非線性的LLM訓練的PI友好替代方案。我們的研究彌合了信息理論和架構設計之間的差距，將熵動力學確立為開發高效PI架構的原則指南。代碼和實現可在以下鏈接找到：https://github.com/Nandan91/entropy-guided-attention-llm{entropy-guided-llm}。

English

The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI. By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity. We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm{entropy-guided-llm}.

基於熵的注意力導向私密LLMs

Entropy-Guided Attention for Private LLMs

摘要

Summary

Support