抽象概念的出現：Transformer 中的情境學習概念編碼與解碼機制

摘要

人類將複雜的經驗提煉成基本抽象，以促進快速學習和適應。同樣地，自回歸變壓器通過上下文學習（ICL）展現適應性學習，這引出一個問題，即如何實現。在本文中，我們提出概念編碼-解碼機制，通過研究變壓器如何形成和使用內部抽象來解釋ICL。在合成ICL任務中，我們分析了一個小型變壓器的訓練動態，並報告了概念編碼和解碼的相互產生。隨著模型學習將不同的潛在概念（例如，“找到句子中的第一個名詞”）編碼為不同、可分離的表示，它同時構建條件解碼算法並提高其ICL性能。我們驗證了這種機制存在於不同規模的預訓練模型（Gemma-2 2B/9B/27B，Llama-3.1 8B/70B）中。此外，通過機械干預和受控微調，我們證明概念編碼的質量與ICL性能有因果關係並具有預測性。我們的實證見解有助於更好地理解大型語言模型通過其表示形式的成功和失敗模式。

English

Humans distill complex experiences into fundamental abstractions that enable rapid learning and adaptation. Similarly, autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. In this paper, we propose concept encoding-decoding mechanism to explain ICL by studying how transformers form and use internal abstractions in their representations. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of concept encoding and decoding. As the model learns to encode different latent concepts (e.g., ``Finding the first noun in a sentence.") into distinct, separable representations, it concureently builds conditional decoding algorithms and improve its ICL performance. We validate the existence of this mechanism across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B). Further, through mechanistic interventions and controlled finetuning, we demonstrate that the quality of concept encoding is causally related and predictive of ICL performance. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.

抽象概念的出現：Transformer 中的情境學習概念編碼與解碼機制

Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers

摘要

Summary

Support