ChatPaper.aiChatPaper

“主成分”开启图像表达新语言

"Principal Components" Enable A New Language of Images

March 11, 2025
作者: Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, Xiaojuan Qi
cs.AI

摘要

我们提出了一种新颖的视觉标记化框架,该框架将可证明的类似PCA(主成分分析)结构嵌入到潜在标记空间中。现有的视觉标记化方法主要优化重建保真度,却往往忽视了潜在空间的结构特性——这对于可解释性和下游任务至关重要。我们的方法为图像生成一维因果标记序列,其中每个后续标记贡献的信息互不重叠,且具有数学保证的递减解释方差,类似于主成分分析。这种结构约束确保了标记化器首先提取最显著的视觉特征,随后每个标记添加的信息虽逐渐减少但互为补充。此外,我们识别并解决了语义-频谱耦合效应,该效应导致高级语义内容与低级频谱细节在标记中不必要地纠缠,通过利用扩散解码器解决了这一问题。实验表明,我们的方法在重建性能上达到了最新水平,并实现了与人类视觉系统更好对齐的可解释性。此外,基于我们标记序列训练的自回归模型,在训练和推理所需标记更少的情况下,性能与当前最先进方法相当。
English
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. While existing visual tokenizers primarily optimize for reconstruction fidelity, they often neglect the structural properties of the latent space -- a critical factor for both interpretability and downstream tasks. Our method generates a 1D causal token sequence for images, where each successive token contributes non-overlapping information with mathematically guaranteed decreasing explained variance, analogous to principal component analysis. This structural constraint ensures the tokenizer extracts the most salient visual features first, with each subsequent token adding diminishing yet complementary information. Additionally, we identified and resolved a semantic-spectrum coupling effect that causes the unwanted entanglement of high-level semantic content and low-level spectral details in the tokens by leveraging a diffusion decoder. Experiments demonstrate that our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system. Moreover, auto-regressive models trained on our token sequences achieve performance comparable to current state-of-the-art methods while requiring fewer tokens for training and inference.

Summary

AI-Generated Summary

PDF102March 12, 2025