공간에서 생각하기: 다중 모달 대형 언어 모델이 공간을 인식하고 기억하며 회상하는 방식

초록

인간은 순차적 시각 관찰로 공간을 기억하는 시각-공간 지능을 갖고 있습니다. 그러나 백만 규모의 비디오 데이터셋에서 훈련된 다중 모달 대규모 언어 모델(Multimodal Large Language Models, MLLMs)이 비디오로부터도 '공간에서 생각'할 수 있을까요? 우리는 5,000개 이상의 질문-답변 쌍으로 이루어진 혁신적인 비디오 기반 시각-공간 지능 벤치마크(VSI-Bench)를 제시하고, MLLMs가 경쟁력 있는 - 비인간적인 - 시각-공간 지능을 나타내는 것을 발견했습니다. 우리는 모델들이 언어적으로와 시각적으로 어떻게 공간에서 생각하는지 표현하도록 조사하였고, MLLMs가 높은 벤치마크 성능에 도달하기 위한 주요 병목 현상인 공간 추론 능력은 여전히 유지되지만, 지역 세계 모델과 공간 인식이 이러한 모델 내에서 나타남을 발견했습니다. 특히, 현존하는 언어적 추론 기술(예: 사고 연쇄, 자기 일관성, 사고 트리)은 성능을 향상시키지 못하는 반면, 질문-답변 과정에서 인지적 지도를 명시적으로 생성함으로써 MLLMs의 공간 거리 능력을 향상시키는 것이 가능하다는 것을 발견했습니다.

English

Humans possess the visual-spatial intelligence to remember spaces from sequential visual observations. However, can Multimodal Large Language Models (MLLMs) trained on million-scale video datasets also ``think in space'' from videos? We present a novel video-based visual-spatial intelligence benchmark (VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit competitive - though subhuman - visual-spatial intelligence. We probe models to express how they think in space both linguistically and visually and find that while spatial reasoning capabilities remain the primary bottleneck for MLLMs to reach higher benchmark performance, local world models and spatial awareness do emerge within these models. Notably, prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance, whereas explicitly generating cognitive maps during question-answering enhances MLLMs' spatial distance ability.

공간에서 생각하기: 다중 모달 대형 언어 모델이 공간을 인식하고 기억하며 회상하는 방식

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

초록

Support