像素世界:朝向将一切视为像素感知
PixelWorld: Towards Perceiving Everything as Pixels
January 31, 2025
作者: Zhiheng Lyu, Xueguang Ma, Wenhu Chen
cs.AI
摘要
现有的基础模型通常将视觉输入处理为像素,将文本输入处理为标记,这与人类感知形式相反,人类会统一处理这两种模态。随着具身和主体 AI 的兴起,其中输入主要来自摄像头像素,统一感知框架的需求变得日益明显。在本文中,我们提出统一所有模态(文本、表格、代码、图表、图像等)为像素输入的概念,即“将所有事物视为像素”(PEAP)。我们引入了PixelWorld,一个新颖的评估套件,将所有提到的模态统一到像素空间中,以评估现有模型的性能。我们的研究结果表明:(1)PEAP在多模态数据集中的表现优于基于标记输入的基准模型,受益于统一输入以获得更好的消歧能力;(2)在处理基于像素的输入时,所有模型在推理和编码能力方面都出现显著下降,突显了增强基础模型感知能力的必要性;(3)较大的模型可以在PEAP下保持非推理任务的强劲表现,而像Phi-3.5-V这样的较小模型则会遭受显著的性能下降;(4)PEAP的注意模式与文本标记输入高度一致;(5)通过利用空间稀疏性,PEAP的加速效果显著。我们得出结论,现有的前沿模型在像素感知方面表现出色,但仍有改进空间。我们的代码和数据集将在接受后发布。
English
Existing foundation models typically process visual input as pixels and
textual input as tokens, a paradigm that contrasts with human perception, where
both modalities are processed in a unified manner. With the rise of embodied
and agentic AI, where inputs primarily come from camera pixels, the need for a
unified perception framework becomes increasingly evident. In this paper, we
propose to unify all modalities (text, tables, code, diagrams, images, etc) as
pixel inputs, i.e. "Perceive Everything as Pixels" (PEAP). We introduce
PixelWorld, a novel evaluation suite that unifies all the mentioned modalities
into pixel space to gauge the existing models' performance. Our findings show
that (1) PEAP outperforms baseline with token-based input in multimodal
datasets, benefiting from unified input for better disambiguation, (2)
significant declines in reasoning and coding capabilities across all models
when processing pixel-based input, underscoring the need to enhance foundation
models' perceptual abilities, (3) larger models can maintain strong performance
on non-reasoning tasks under PEAP, while smaller models like Phi-3.5-V suffer
significant performance degradation, (4) the attention pattern of PEAP is
highly aligned with text token input, (5) PEAP can be accelerated significantly
by exploiting the spatial sparsity. We conclude that the existing frontier
models are competent in pixel perception, however, there is still headroom for
improvement. Our code, dataset will be released upon acceptance.Summary
AI-Generated Summary