抽象概念的出现：Transformer 中用于上下文学习的概念编码和解码机制

摘要

人类将复杂经验提炼为基本抽象，从而实现快速学习和适应。同样，自回归变压器通过上下文学习（ICL）展现出自适应学习能力，这引发了一个问题：如何实现这一点。在本文中，我们提出了概念编码-解码机制，通过研究变压器在其表示中形成和使用内部抽象来解释ICL。在合成ICL任务中，我们分析了一个小型变压器的训练动态，并报告了概念编码和解码的耦合出现。随着模型学会将不同的潜在概念（例如“找到句子中的第一个名词”）编码为不同的可分离表示，它同时构建条件解码算法并改善其ICL性能。我们验证了这一机制存在于不同规模的预训练模型（Gemma-2 2B/9B/27B，Llama-3.1 8B/70B）中。此外，通过机械干预和控制微调，我们证明了概念编码质量与ICL性能之间的因果关系和预测性。我们的实证见解有助于更好地理解大型语言模型通过其表示的成功和失败模式。

English

Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose concept encoding-decoding mechanism to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.

抽象概念的出现：Transformer 中用于上下文学习的概念编码和解码机制

Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers

摘要

Support