上下文學習與奧卡姆剃刀
In-context learning and Occam's razor
October 17, 2024
作者: Eric Elmoznino, Tom Marty, Tejas Kasetty, Leo Gagnon, Sarthak Mittal, Mahan Fathi, Dhanya Sridhar, Guillaume Lajoie
cs.AI
摘要
機器學習的目標是泛化。雖然「沒有免費午餐」定理指出,在沒有進一步假設的情況下,我們無法為泛化獲得理論保證,但實際上我們觀察到解釋訓練數據的簡單模型泛化效果最佳:這是一個被稱為奧卡姆剃刀的原則。儘管需要簡單模型,但目前大多數機器學習方法僅最小化訓練誤差,最多通過正則化或架構設計間接促進簡單性。在這裡,我們建立了奧卡姆剃刀與上下文學習之間的聯繫:這是某些序列模型(如Transformer)在推論時從過去觀察到的序列中學習的一種新興能力。具體而言,我們展示了用於訓練上下文學習者的下一個標記預測損失直接等效於一種名為預測編碼的數據壓縮技術,並且最小化這種損失相當於聯合最小化從上下文中隱式學習的模型的訓練誤差和複雜性。我們的理論和用於支持它的實驗不僅提供了上下文學習的規範說明,還闡明了當前上下文學習方法的缺點,並提出了改進方法。我們在https://github.com/3rdCore/PrequentialCode 上提供我們的代碼。
English
The goal of machine learning is generalization. While the No Free Lunch
Theorem states that we cannot obtain theoretical guarantees for generalization
without further assumptions, in practice we observe that simple models which
explain the training data generalize best: a principle called Occam's razor.
Despite the need for simple models, most current approaches in machine learning
only minimize the training error, and at best indirectly promote simplicity
through regularization or architecture design. Here, we draw a connection
between Occam's razor and in-context learning: an emergent ability of certain
sequence models like Transformers to learn at inference time from past
observations in a sequence. In particular, we show that the next-token
prediction loss used to train in-context learners is directly equivalent to a
data compression technique called prequential coding, and that minimizing this
loss amounts to jointly minimizing both the training error and the complexity
of the model that was implicitly learned from context. Our theory and the
empirical experiments we use to support it not only provide a normative account
of in-context learning, but also elucidate the shortcomings of current
in-context learning methods, suggesting ways in which they can be improved. We
make our code available at https://github.com/3rdCore/PrequentialCode.Summary
AI-Generated Summary