将1568个词元压缩至单一向量并还原：探索嵌入空间容量的极限

摘要

近期一系列研究致力于解决将符号序列压缩为更短的实值向量序列的问题，这些向量序列可替代符号嵌入或键值缓存作为模型输入。此类方法能够有效减少现有语言模型中的计算量。尽管这些方法依赖于强大的编码器模型，但可达到的最大无损压缩比通常不超过10倍。这一现象尤为引人深思，因为理论上，即便采用16位精度和适中的向量尺寸，大型实值向量的最大信息容量也远超当前所展示的压缩率。在本研究中，我们通过将编码器替换为逐样本优化程序，探索了压缩的极限。我们展示了压缩比高达1500倍的向量存在，这揭示了现有方案与实际上可达到的解决方案之间存在两个数量级的差距。此外，我们通过实验证明，压缩极限并非由输入长度决定，而是由需要减少的不确定性量决定，即该序列在无任何条件作用下的交叉熵损失。所获得的极限值凸显了输入嵌入的理论容量与其实际应用之间的显著差距，表明在模型设计上存在巨大的优化空间。

English

A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.

将1568个词元压缩至单一向量并还原：探索嵌入空间容量的极限

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

摘要

Summary

Support