픽셀월드: 모든 것을 픽셀로 인식하기 위해

초록

기존의 기반 모델은 일반적으로 시각적 입력을 픽셀로, 텍스트 입력을 토큰으로 처리하는데, 이는 인간의 지각과 대조적이다. 인간은 두 가지 모드를 통합된 방식으로 처리하는 반면, 신체화된 그리고 주체적인 AI가 부상함에 따라 입력이 주로 카메라 픽셀에서 나오는 경우, 통합된 지각 프레임워크의 필요성이 점점 더 명백해지고 있다. 본 논문에서는 모든 모드(텍스트, 테이블, 코드, 다이어그램, 이미지 등)를 픽셀 입력으로 통합하는 "모든 것을 픽셀로 인식" (PEAP)을 제안한다. 우리는 PixelWorld를 소개하는데, 이는 기존 모델의 성능을 측정하기 위해 모든 언급된 모드를 픽셀 공간으로 통합하는 혁신적인 평가 스위트이다. 우리의 연구 결과는 다음과 같다: (1) PEAP은 다중 모달 데이터셋에서 토큰 기반 입력과 비교하여 우수한 성과를 보이며, 더 나은 모호성 해소를 위해 통합된 입력을 활용한다. (2) 픽셀 기반 입력 처리 시 모든 모델에서 추론 및 코딩 능력이 상당히 감소하며, 기반 모델의 지각 능력을 향상시킬 필요성을 강조한다. (3) 대형 모델은 PEAP에서 비추론 작업에 대해 강력한 성능을 유지할 수 있지만, Phi-3.5-V와 같은 작은 모델은 상당한 성능 하락을 겪는다. (4) PEAP의 주의 집중 패턴은 텍스트 토큰 입력과 매우 일치한다. (5) PEAP는 공간 희소성을 활용하여 크게 가속화될 수 있다. 우리는 기존의 선두 모델이 픽셀 지각에서 유능하다고 결론 내리지만, 아직 개선할 여지가 있다. 우리의 코드와 데이터셋은 승인 후에 공개될 것이다.

English

Existing foundation models typically process visual input as pixels and textual input as tokens, a paradigm that contrasts with human perception, where both modalities are processed in a unified manner. With the rise of embodied and agentic AI, where inputs primarily come from camera pixels, the need for a unified perception framework becomes increasingly evident. In this paper, we propose to unify all modalities (text, tables, code, diagrams, images, etc) as pixel inputs, i.e. "Perceive Everything as Pixels" (PEAP). We introduce PixelWorld, a novel evaluation suite that unifies all the mentioned modalities into pixel space to gauge the existing models' performance. Our findings show that (1) PEAP outperforms baseline with token-based input in multimodal datasets, benefiting from unified input for better disambiguation, (2) significant declines in reasoning and coding capabilities across all models when processing pixel-based input, underscoring the need to enhance foundation models' perceptual abilities, (3) larger models can maintain strong performance on non-reasoning tasks under PEAP, while smaller models like Phi-3.5-V suffer significant performance degradation, (4) the attention pattern of PEAP is highly aligned with text token input, (5) PEAP can be accelerated significantly by exploiting the spatial sparsity. We conclude that the existing frontier models are competent in pixel perception, however, there is still headroom for improvement. Our code, dataset will be released upon acceptance.

픽셀월드: 모든 것을 픽셀로 인식하기 위해

PixelWorld: Towards Perceiving Everything as Pixels

초록

Support