透過遞迴分配的方式進行適應性長度影像標記化

摘要

目前的視覺系統通常對圖像分配固定長度的表示，而不考慮信息內容。這與人類智能以及大型語言模型形成對比，後者基於熵、上下文和熟悉度分配不同的表示能力。受此啟發，我們提出了一種方法來學習2D圖像的可變長度標記表示。我們的編碼器-解碼器架構遞歸地處理2D圖像標記，將它們提煉成多次遞歸滾動的1D潛在標記。每次迭代都會優化2D標記、更新現有的1D潛在標記，並通過添加新標記來自適應性增加表示能力。這使得圖像可以壓縮為可變數量的標記，範圍從32到256。我們通過重建損失和FID指標驗證了我們的標記器，顯示標記數量與圖像熵、熟悉度和下游任務要求保持一致。每次迭代中表示能力增加的遞歸標記處理顯示出標記專業化的跡象，揭示了對象/部件發現的潛力。

English

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.

透過遞迴分配的方式進行適應性長度影像標記化

Adaptive Length Image Tokenization via Recurrent Allocation

摘要

Summary

Support

Support