万物互联:探索测试时记忆、注意力偏差、信息保留与在线优化的旅程
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
April 17, 2025
作者: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni
cs.AI
摘要
设计高效且强大的架构骨干一直是提升基础模型能力的核心研究方向。受人类认知现象——注意力偏差(即自然倾向于优先处理某些事件或刺激)的启发,我们重新构思了包括Transformer、Titans及现代线性循环神经网络在内的神经架构,将其视为通过内部目标(称为注意力偏差)学习键值映射的联想记忆模块。令人惊讶的是,我们发现大多数现有序列模型主要依赖(1)点积相似度或(2)L2回归目标作为其注意力偏差。超越这些目标,我们提出了一系列替代的注意力偏差配置及其有效近似方法,以稳定训练过程。随后,我们将现代深度学习架构中的遗忘机制重新解释为一种保留正则化形式,为序列模型提供了一套新颖的遗忘门。基于这些洞见,我们提出了Miras框架,这是一个基于四项选择设计深度学习架构的通用框架:(i)联想记忆架构,(ii)注意力偏差目标,(iii)保留门,以及(iv)记忆学习算法。我们展示了三种新型序列模型——Moneta、Yaad和Memora,它们在超越现有线性RNN能力的同时,保持了快速并行化的训练过程。实验表明,Miras中的不同设计选择能产生各具优势的模型。例如,某些Miras实例在语言建模、常识推理及记忆密集型任务等特定任务中表现卓越,甚至超越了Transformer及其他现代线性循环模型。
English
Designing efficient and effective architectural backbones has been in the
core of research efforts to enhance the capability of foundation models.
Inspired by the human cognitive phenomenon of attentional bias-the natural
tendency to prioritize certain events or stimuli-we reconceptualize neural
architectures, including Transformers, Titans, and modern linear recurrent
neural networks as associative memory modules that learn a mapping of keys and
values using an internal objective, referred to as attentional bias.
Surprisingly, we observed that most existing sequence models leverage either
(1) dot-product similarity, or (2) L2 regression objectives as their
attentional bias. Going beyond these objectives, we present a set of
alternative attentional bias configurations along with their effective
approximations to stabilize their training procedure. We then reinterpret
forgetting mechanisms in modern deep learning architectures as a form of
retention regularization, providing a novel set of forget gates for sequence
models. Building upon these insights, we present Miras, a general framework to
design deep learning architectures based on four choices of: (i) associative
memory architecture, (ii) attentional bias objective, (iii) retention gate, and
(iv) memory learning algorithm. We present three novel sequence models-Moneta,
Yaad, and Memora-that go beyond the power of existing linear RNNs while
maintaining a fast parallelizable training process. Our experiments show
different design choices in Miras yield models with varying strengths. For
example, certain instances of Miras achieve exceptional performance in special
tasks such as language modeling, commonsense reasoning, and recall intensive
tasks, even outperforming Transformers and other modern linear recurrent
models.Summary
AI-Generated Summary