3DGraphLLM: 3D 장면 이해를 위한 시맨틱 그래프와 대규모 언어 모델의 결합

초록

3D 장면 그래프는 객체와 그들 사이의 의미적 관계에 대한 정보를 저장하는 간결한 장면 모델을 나타내며, 이는 로봇 작업에 유용하게 활용될 수 있습니다. 사용자와 상호 작용할 때, 구현된 지능 있는 에이전트는 자연어로 표현된 장면에 대한 다양한 쿼리에 응답할 수 있어야 합니다. 대형 언어 모델(LLMs)은 자연어 이해 및 추론 능력으로 인해 사용자-로봇 상호 작용에 유익한 솔루션입니다. 최근에는 3D 장면의 학습 가능한 표현을 생성하는 방법들이 3D 세계에 적응하여 LLMs의 응답 품질을 향상시킬 잠재력을 보여주었습니다. 그러나 기존 방법은 객체 간 의미적 관계에 대한 정보를 명시적으로 활용하지 않아, 그들의 좌표에 대한 정보로 제한됩니다. 본 연구에서는 3D 장면 그래프의 학습 가능한 표현을 구성하기 위한 3DGraphLLM 방법을 제안합니다. 학습 가능한 표현은 LLMs의 입력으로 사용되어 3D 비전-언어 작업을 수행합니다. 인기 있는 ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, Scan2cap 데이터셋에서의 실험에서, 객체 간 의미적 관계에 대한 정보를 사용하지 않는 기본 방법에 비해 이 접근 방식의 장점을 입증합니다. 코드는 https://github.com/CognitiveAISystems/3DGraphLLM 에서 공개적으로 이용 가능합니다.

English

A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, RIORefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects. The code is publicly available at https://github.com/CognitiveAISystems/3DGraphLLM.

3DGraphLLM: 3D 장면 이해를 위한 시맨틱 그래프와 대규모 언어 모델의 결합

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

초록

Support