LSceneLLM:使用自適應視覺偏好增強大型3D場景理解
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
December 2, 2024
作者: Hongyan Zhi, Peihao Chen, Junyan Li, Shuailei Ma, Xinyu Sun, Tianhang Xiang, Yinjie Lei, Mingkui Tan, Chuang Gan
cs.AI
摘要
近來,對於三維視覺語言模型(3D-VLMs)的研究越來越受到關注,這對於在三維場景中發展具體化人工智慧至關重要,例如視覺導航和具體化問答。由於視覺特徵在大型三維場景中非常密集,準確定位任務相關的視覺信息具有挑戰性。現有研究試圖對所有物體進行分割,並將它們的特徵視為場景表示。然而,這些任務不可知的物體特徵包含大量冗餘信息,並且缺少任務相關區域的細節。為了應對這些問題,我們提出了LSceneLLM,一個自適應框架,通過利用LLM對不同任務的視覺偏好自動識別任務相關區域,然後使用即插即用的場景放大器模塊捕獲焦點區域的細節。具體而言,一個密集的標記選擇器檢查LLM的注意力地圖,以識別指令輸入的視覺偏好,然後放大焦點區域的細節。利用自適應自注意力模塊融合粗粒和選定的細粒視覺信息。為了全面評估3D-VLMs的大場景理解能力,我們進一步引入了一個跨房間理解基準XR-Scene,其中包含一系列大場景理解任務,包括XR-QA、XR-EmbodiedPlanning和XR-SceneCaption。實驗表明,我們的方法在大場景理解和現有場景理解基準上均優於現有方法。將我們的場景放大器模塊應用於現有的3D-VLMs中也帶來了顯著的改進。
English
Research on 3D Vision-Language Models (3D-VLMs) is gaining increasing
attention, which is crucial for developing embodied AI within 3D scenes, such
as visual navigation and embodied question answering. Due to the high density
of visual features, especially in large 3D scenes, accurately locating
task-relevant visual information is challenging. Existing works attempt to
segment all objects and consider their features as scene representations.
However, these task-agnostic object features include much redundant information
and missing details for the task-relevant area. To tackle these problems, we
propose LSceneLLM, an adaptive framework that automatically identifies
task-relevant areas by leveraging LLM's visual preference for different tasks,
followed by a plug-and-play scene magnifier module to capture fine-grained
details in focused areas. Specifically, a dense token selector examines the
attention map of LLM to identify visual preferences for the instruction input.
It then magnifies fine-grained details of the focusing area. An adaptive
self-attention module is leveraged to fuse the coarse-grained and selected
fine-grained visual information. To comprehensively evaluate the large scene
understanding ability of 3D-VLMs, we further introduce a cross-room
understanding benchmark, XR-Scene, which contains a series of large scene
understanding tasks including XR-QA, XR-EmbodiedPlanning, and XR-SceneCaption.
Experiments show that our method surpasses existing methods on both large scene
understanding and existing scene understanding benchmarks. Plunging our scene
magnifier module into the existing 3D-VLMs also brings significant improvement.Summary
AI-Generated Summary