3DGraphLLM：結合語義圖和大型語言模型進行3D場景理解

摘要

一個3D場景圖表示一個緊湊的場景模型，存儲有關物體和它們之間語義關係的信息，使其在機器人任務中的使用變得有前景。當與用戶互動時，一個具體的智能代理應該能夠回應用自然語言制定的有關場景的各種查詢。大型語言模型（LLMs）由於其自然語言理解和推理能力，對於用戶-機器人交互是有益的解決方案。最近用於創建可學習的3D場景表示的方法已經展示了通過適應3D世界來改進LLMs回應質量的潛力。然而，現有方法並沒有明確利用物體之間語義關係的信息，而是限制於它們的坐標信息。在這項工作中，我們提出了一種名為3DGraphLLM的方法，用於構建3D場景圖的可學習表示。這種可學習表示被用作LLMs的輸入，以執行3D視覺-語言任務。在我們對流行的ScanRefer、RIORefer、Multi3DRefer、ScanQA、Sqa3D和Scan2cap數據集的實驗中，我們展示了這種方法相對於不使用物體之間語義關係信息的基準方法的優勢。代碼公開可在https://github.com/CognitiveAISystems/3DGraphLLM找到。

English

A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

3DGraphLLM：結合語義圖和大型語言模型進行3D場景理解

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

摘要

Support