LSceneLLM:利用自适应视觉偏好增强大型三维场景理解
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
December 2, 2024
作者: Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, Chuang Gan
cs.AI
摘要
对3D视觉-语言模型(3D-VLMs)的研究越来越受到关注,这对于在3D场景中开发具有实体性的人工智能至关重要,例如视觉导航和实体问题回答。由于视觉特征的高密度,尤其是在大型3D场景中,准确定位任务相关的视觉信息具有挑战性。现有研究尝试对所有对象进行分割,并将它们的特征视为场景表示。然而,这些任务无关的对象特征包含大量冗余信息,缺少任务相关区域的细节。为了解决这些问题,我们提出了LSceneLLM,这是一个自适应框架,通过利用LLM对不同任务的视觉偏好,自动识别任务相关区域,然后使用即插即用的场景放大器模块来捕获关注区域的细粒度细节。具体来说,密集的标记选择器检查LLM的注意力图,以识别指令输入的视觉偏好,然后放大关注区域的细节。利用自适应自注意力模块融合粗粒度和选择的细粒度视觉信息。为了全面评估3D-VLMs对大场景的理解能力,我们进一步引入了一个跨房间理解基准XR-Scene,其中包含一系列大场景理解任务,包括XR-QA、XR-实体规划和XR-SceneCaption。实验表明,我们的方法在大场景理解和现有场景理解基准上均优于现有方法。将我们的场景放大器模块引入现有的3D-VLMs也带来了显著改进。
English
Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing
attention, which is crucial for developing embodied AI within 3D scenes, such
as visual navigation and embodied question answering. Due to the high density
of visual features, especially in large 3D scenes, accurately locating
task-relevant visual information is challenging. Existing works attempt to
segment all objects and consider their features as scene representations.
However, these task-agnostic object features include much redundant information
and missing details for the task-relevant area. To tackle these problems, we
propose LSceneLLM, an adaptive framework that automatically identifies
task-relevant areas by leveraging LLM's visual preference for different tasks,
followed by a plug-and-play scene magnifier module to capture fine-grained
details in focused areas. Specifically, a dense token selector examines the
attention map of LLM to identify visual preferences for the instruction input.
It then magnifies fine-grained details of the focusing area. An adaptive
self-attention module is leveraged to fuse the coarse-grained and selected
fine-grained visual information. To comprehensively evaluate the large scene
understanding ability of 3D-VLMs, we further introduce a cross-room
understanding benchmark, XR-Scene, which contains a series of large scene
understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption.
Experiments show that our method surpasses existing methods on both large scene
understanding and existing scene understanding benchmarks. Plunging our scene
magnifier module into the existing 3D-VLMs also brings significant improvement.Summary
AI-Generated Summary