ChatPaper.aiChatPaper

L^2M:长上下文语言建模的互信息缩放定律

L^2M: Mutual Information Scaling Law for Long-Context Language Modeling

March 6, 2025
作者: Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić
cs.AI

摘要

我们严格确立了自然语言中支配长程依赖性的二元互信息缩放定律。这一缩放定律,我们证明其与传统两点互信息不同且独立缩放,是理解长上下文语言建模的关键。基于此缩放定律,我们提出了长上下文语言建模(L^2M)条件,该条件将模型有效处理长上下文长度的能力与其用于存储过去信息的潜在状态规模的缩放相关联。我们的结果通过在Transformer模型和状态空间模型上的实验得到了验证。这项工作为引导大语言模型向更长上下文长度发展奠定了理论基础。
English
We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L^2M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.

Summary

AI-Generated Summary

PDF192March 7, 2025