3DGraphLLM：将语义图和大型语言模型相结合，用于3D场景理解

摘要

一种3D场景图表示了一个紧凑的场景模型，存储了关于对象及它们之间语义关系的信息，使其在机器人任务中的应用变得有前景。当与用户交互时，一个具有实体的智能代理应能够回应用自然语言提出的关于场景的各种查询。大型语言模型（LLMs）由于其自然语言理解和推理能力，对于用户-机器人交互是有益的解决方案。最近用于创建可学习的3D场景表示的方法已经展示了通过适应3D世界来提高LLMs响应质量的潜力。然而，现有方法并未明确利用关于对象之间语义关系的信息，而是仅限于它们的坐标信息。在这项工作中，我们提出了一种名为3DGraphLLM的方法，用于构建3D场景图的可学习表示。这种可学习表示被用作LLMs的输入，以执行3D视觉-语言任务。在我们对流行的ScanRefer、RIORefer、Multi3DRefer、ScanQA、Sqa3D和Scan2cap数据集进行的实验中，我们展示了这种方法相对于不使用关于对象之间语义关系信息的基准方法的优势。代码可在以下网址公开获取：https://github.com/CognitiveAISystems/3DGraphLLM。

English

A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

3DGraphLLM：将语义图和大型语言模型相结合，用于3D场景理解

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

摘要

Support