PLDR-LLMs学会了一种可泛化的张量运算器，能够在推理阶段替代其自身的深度神经网络。

摘要

我们证明，基于幂律解码器表示的大语言模型（PLDR-LLM）是一种基础模型，其演绎输出在微小扰动下保持张量不变性。PLDR-LLM学习了一种针对演绎输出的奇异性条件，使得一旦推断出的能量-曲率张量G_{LM}能够在推理时替代生成演绎输出的幂律图注意力（PLGA）深度神经网络。我们展示了G_{LM}缓存（G-cache）与KV缓存的实现方式简便，可显著提升推理速度。演绎输出的不变性与泛化性表现出极高的保真度，缓存后其均方根误差（RMSE）和行列式值在15位小数内保持一致，且零样本基准得分保持不变。消融研究表明，学习到的演绎输出在损失和准确率特性上，与使用迁移、随机初始化或恒等张量作为常数张量算子预训练的模型有显著差异，而采用缩放点积注意力（SDPA）的LLM是PLDR-LLM的一个特例，其中G_{LM}被预设为恒等张量。观察到的这一不变性特性引入了训练与带缓存推理阶段间的一种新颖不对称性。我们概述了学习奇异性条件下演绎输出的常见特征，并提供了结合KV缓存与G缓存的PLDR-LLM训练与推理框架的实现方案。

English

We show that Large Language Model from Power Law Decoder Representations (PLDR-LLM) is a foundational model whose deductive outputs are invariant tensors up to a small perturbation. PLDR-LLM learns a singularity condition for the deductive outputs that enable the once-inferred energy-curvature tensor G_{LM} to replace the deep neural network of power law graph attention (PLGA) generating the deductive outputs at inference. We demonstrate that a cache for G_{LM} (G-cache) and KV-cache can be implemented in a straightforward manner to improve the inference time. The invariance and generalizable nature of deductive outputs is at a very high fidelity where deductive outputs have same RMSE and determinant values up to 15 decimal places after caching, and zero-shot benchmark scores remain unchanged. Ablation studies show that learned deductive outputs have distinct loss and accuracy characteristics from models pretrained with transferred, randomly initialized or identity tensors as a constant tensor operator and an LLM with scaled-dot product attention (SDPA) is a special case of PLDR-LLM where G_{LM} is predefined as identity. The observed invariance characteristic introduces a novel asymmetry between training and inference phases with caching. We outline observed common characteristics of the deductive outputs for the learned singularity condition. We provide an implementation of a training and inference framework for PLDR-LLM with KV-cache and G-cache.

PLDR-LLMs学会了一种可泛化的张量运算器，能够在推理阶段替代其自身的深度神经网络。

PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference

摘要

Summary

Support