반복 할당을 통한 적응형 길이 이미지 토큰화

초록

현재의 시각 시스템은 일반적으로 정보 내용과 관계없이 이미지에 고정 길이의 표현을 할당합니다. 이는 인간 지능 또는 대형 언어 모델과 대조됩니다. 후자는 엔트로피, 맥락 및 익숙함에 기반한 다양한 표현 능력을 할당합니다. 이에 영감을 받아, 우리는 2D 이미지에 대해 가변 길이의 토큰 표현을 학습하는 방법을 제안합니다. 우리의 인코더-디코더 아키텍처는 2D 이미지 토큰을 재귀적으로 처리하여 여러 번의 반복된 순환을 통해 1D 잠재 토큰으로 압축합니다. 각 반복은 2D 토큰을 정제하고 기존의 1D 잠재 토큰을 업데이트하며 새로운 토큰을 추가함으로써 표현 능력을 증가시킵니다. 이를 통해 이미지를 32에서 256까지의 가변 수의 토큰으로 압축할 수 있습니다. 우리는 재구성 손실과 FID 지표를 사용하여 우리의 토크나이저를 검증하며, 토큰 수가 이미지 엔트로피, 익숙함 및 하향 작업 요구와 일치함을 보여줍니다. 각 반복에서 표현 능력이 증가하는 반복적인 토큰 처리는 토큰 특화의 징후를 보여주며, 객체/부분 발견의 잠재력을 드러냅니다.

English

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence - and even large language models - which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. We validate our tokenizer using reconstruction loss and FID metrics, demonstrating that token count aligns with image entropy, familiarity and downstream task requirements. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery.

반복 할당을 통한 적응형 길이 이미지 토큰화

Adaptive Length Image Tokenization via Recurrent Allocation

초록

Support