通过循环分配实现自适应长度图像标记化
Adaptive Length Image Tokenization via Recurrent Allocation
November 4, 2024
作者: Shivam Duggal, Phillip Isola, Antonio Torralba, William T. Freeman
cs.AI
摘要
当前的视觉系统通常会为图像分配固定长度的表示,而不考虑信息内容。这与人类智能以及大型语言模型不同,后者会根据熵、上下文和熟悉度分配不同的表示能力。受此启发,我们提出了一种学习2D图像可变长度标记表示的方法。我们的编码器-解码器架构递归地处理2D图像标记,将它们提炼成多次迭代中的1D潜在标记。每次迭代都会优化2D标记,更新现有的1D潜在标记,并通过添加新标记自适应地增加表示能力。这使得图像可以被压缩为可变数量的标记,范围从32到256。我们使用重构损失和FID指标验证了我们的标记器,证明了标记数量与图像熵、熟悉度和下游任务需求保持一致。每次迭代中表示能力增加的循环标记处理显示出标记专业化的迹象,揭示了对象/部件发现的潜力。
English
Current vision systems typically assign fixed-length representations to
images, regardless of the information content. This contrasts with human
intelligence - and even large language models - which allocate varying
representational capacities based on entropy, context and familiarity. Inspired
by this, we propose an approach to learn variable-length token representations
for 2D images. Our encoder-decoder architecture recursively processes 2D image
tokens, distilling them into 1D latent tokens over multiple iterations of
recurrent rollouts. Each iteration refines the 2D tokens, updates the existing
1D latent tokens, and adaptively increases representational capacity by adding
new tokens. This enables compression of images into a variable number of
tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction
loss and FID metrics, demonstrating that token count aligns with image entropy,
familiarity and downstream task requirements. Recurrent token processing with
increasing representational capacity in each iteration shows signs of token
specialization, revealing potential for object / part discovery.Summary
AI-Generated Summary