대규모 언어 모델을 마르코프 체인으로 사용하기

초록

대형 언어 모델 (LLM)은 자연어 처리 작업을 넘어서 넓은 범위에서 효율적임이 입증되었습니다. 그러나 그들의 놀라운 성능 기원에 대한 포괄적인 이론적 분석은 아직 미해결 상태입니다. 본 논문에서는 크기가 T이고 문맥 창 크기가 K인 일반적인 자기 회귀 언어 모델과 크기가 O(T^K)인 유한 상태 공간에 정의된 마르코프 체인 간의 동등성을 통해 이 어려운 과제에 접근합니다. LLM의 추론 능력을 포착하는 마르코프 체인의 정상 분포의 존재, 그것으로의 수렴 속도, 그리고 후자에 대한 온도의 영향과 관련된 몇 가지 놀라운 결과를 유도합니다. 그런 다음 사전 훈련 및 문맥 내 일반화 한계를 증명하고 그 동등성을 통해 그들의 해석을 풍부하게 하는 방법을 보여줍니다. 마지막으로, 최근 LLM 몇 가지에 대한 실험을 통해 실제에서 관찰된 행동을 포착하는 방법을 강조하기 위해 우리의 이론적 보증을 설명합니다.

English

Large language models (LLMs) have proven to be remarkably efficient, both across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the origins of their impressive performance remains elusive. In this paper, we approach this challenging task by drawing an equivalence between generic autoregressive language models with vocabulary of size T and context window of size K and Markov chains defined on a finite state space of size O(T^K). We derive several surprising findings related to the existence of a stationary distribution of Markov chains that capture the inference power of LLMs, their speed of convergence to it, and the influence of the temperature on the latter. We then prove pre-training and in-context generalization bounds and show how the drawn equivalence allows us to enrich their interpretation. Finally, we illustrate our theoretical guarantees with experiments on several recent LLMs to highlight how they capture the behavior observed in practice.

대규모 언어 모델을 마르코프 체인으로 사용하기

Large Language Models as Markov Chains

초록

Summary

Support

Support