Video-3D LLM:学习面向3D场景的位置感知视频表示
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
November 30, 2024
作者: Duo Zheng, Shijia Huang, Liwei Wang
cs.AI
摘要
多模态大型语言模型(MLLMs)的快速发展显著影响了各种多模态任务。然而,这些模型在需要对3D环境内的空间理解的任务中面临挑战。已经做出了增强MLLMs的努力,例如整合点云特征,但模型学习表示与3D场景固有复杂性之间仍存在相当大的差距。这种差距主要源于MLLMs在主要是2D数据上的训练,这限制了它们在理解3D空间方面的有效性。为了解决这个问题,在本文中,我们提出了一种新颖的通用模型,即Video-3D LLM,用于3D场景理解。通过将3D场景视为动态视频,并将3D位置编码纳入这些表示中,我们的Video-3D LLM能够更准确地将视频表示与现实世界的空间背景相匹配。此外,我们实施了一种最大覆盖采样技术,以优化计算成本和性能效率之间的平衡。大量实验证明,我们的模型在几个3D场景理解基准测试中取得了最先进的性能,包括ScanRefer、Multi3DRefer、Scan2Cap、ScanQA和SQA3D。
English
The rapid advancement of Multimodal Large Language Models (MLLMs) has
significantly impacted various multimodal tasks. However, these models face
challenges in tasks that require spatial understanding within 3D environments.
Efforts to enhance MLLMs, such as incorporating point cloud features, have been
made, yet a considerable gap remains between the models' learned
representations and the inherent complexity of 3D scenes. This discrepancy
largely stems from the training of MLLMs on predominantly 2D data, which
restricts their effectiveness in comprehending 3D spaces. To address this
issue, in this paper, we propose a novel generalist model, i.e., Video-3D LLM,
for 3D scene understanding. By treating 3D scenes as dynamic videos and
incorporating 3D position encoding into these representations, our Video-3D LLM
aligns video representations with real-world spatial contexts more accurately.
Additionally, we have implemented a maximum coverage sampling technique to
optimize the balance between computational costs and performance efficiency.
Extensive experiments demonstrate that our model achieves state-of-the-art
performance on several 3D scene understanding benchmarks, including ScanRefer,
Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.Summary
AI-Generated Summary