基于熵的注意力机制用于私有LLMs
Entropy-Guided Attention for Private LLMs
January 7, 2025
作者: Nandan Kumar Jha, Brandon Reagen
cs.AI
摘要
专有语言模型的普及引发了关键的隐私问题,迫使隐私推理(PI)取得进展,其中计算直接在加密数据上执行,而不会泄露用户的敏感信息。虽然PI提供了一个有前途的解决方案,但其实际部署受到了大量通信和延迟开销的阻碍,主要源自非线性操作。为了解决这个问题,我们引入了一个信息论框架,用于表征解码器专用语言模型中非线性的作用,为优化适应PI需求的Transformer架构奠定了基础。
通过利用香农熵作为定量衡量标准,我们揭示了以前未曾探索的非线性的双重重要性:除了确保训练稳定性外,它们对于保持注意力头多样性至关重要。具体而言,我们发现它们的移除会触发两种关键的失败模式:深层中的“熵坍塌”会破坏训练稳定性,而早期层中的“熵过载”会导致多头注意力(MHA)表示能力的未充分利用。
我们提出了一个以熵为导向的注意力机制,配合一种新颖的熵正则化技术,以减轻熵过载。此外,我们探讨了适用于PI的替代层归一化方法,用于防止熵坍塌并稳定具有减少非线性的LLM的训练。我们的研究弥合了信息论与架构设计之间的差距,将熵动态确立为开发高效PI架构的原则指南。代码和实现可在https://github.com/Nandan91/entropy-guided-attention-llm{entropy-guided-llm}找到。
English
The pervasiveness of proprietary language models has raised critical privacy
concerns, necessitating advancements in private inference (PI), where
computations are performed directly on encrypted data without revealing users'
sensitive information. While PI offers a promising solution, its practical
deployment is hindered by substantial communication and latency overheads,
primarily stemming from nonlinear operations. To address this, we introduce an
information-theoretic framework to characterize the role of nonlinearities in
decoder-only language models, laying a principled foundation for optimizing
transformer-architectures tailored to the demands of PI.
By leveraging Shannon's entropy as a quantitative measure, we uncover the
previously unexplored dual significance of nonlinearities: beyond ensuring
training stability, they are crucial for maintaining attention head diversity.
Specifically, we find that their removal triggers two critical failure modes:
{\em entropy collapse} in deeper layers that destabilizes training, and {\em
entropic overload} in earlier layers that leads to under-utilization of
Multi-Head Attention's (MHA) representational capacity.
We propose an entropy-guided attention mechanism paired with a novel entropy
regularization technique to mitigate entropic overload. Additionally, we
explore PI-friendly alternatives to layer normalization for preventing entropy
collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our
study bridges the gap between information theory and architectural design,
establishing entropy dynamics as a principled guide for developing efficient PI
architectures. The code and implementation are available at
https://github.com/Nandan91/entropy-guided-attention-llm{entropy-guided-llm}.Summary
AI-Generated Summary