探索在3D LMMs中无编码器架构的潜力
Exploring the Potential of Encoder-free Architectures in 3D LMMs
February 13, 2025
作者: Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
cs.AI
摘要
在2D视觉领域,已初步探索了无编码器架构,然而它们能否有效地应用于3D理解场景仍然是一个悬而未决的问题。本文首次全面调查了无编码器架构克服基于编码器的3D大型多模型(LMMs)挑战的潜力。这些挑战包括无法适应不同点云分辨率以及编码器生成的点特征不符合大型语言模型(LLMs)的语义需求。我们确定了3D LMMs去除编码器并使LLM承担3D编码器角色的关键方面:1)我们在预训练阶段提出了LLM嵌入式语义编码策略,探索各种点云自监督损失的影响。我们提出了混合语义损失以提取高级语义。2)我们在指导调整阶段引入了分层几何聚合策略。这将归纳偏差引入LLM的早期层,以便专注于点云的局部细节。最终,我们提出了第一个无编码器的3D LMM,ENEL。我们的7B模型与当前最先进的模型ShapeLLM-13B不相上下,在分类、字幕和VQA任务上分别达到55.0%、50.92%和42.7%。我们的结果表明,无编码器架构在3D理解领域替代基于编码器的架构具有极高的潜力。代码已发布在https://github.com/Ivan-Tang-3D/ENEL。
English
Encoder-free architectures have been preliminarily explored in the 2D visual
domain, yet it remains an open question whether they can be effectively applied
to 3D understanding scenarios. In this paper, we present the first
comprehensive investigation into the potential of encoder-free architectures to
overcome the challenges of encoder-based 3D Large Multimodal Models (LMMs).
These challenges include the failure to adapt to varying point cloud
resolutions and the point features from the encoder not meeting the semantic
needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to
remove the encoder and enable the LLM to assume the role of the 3D encoder: 1)
We propose the LLM-embedded Semantic Encoding strategy in the pre-training
stage, exploring the effects of various point cloud self-supervised losses. And
we present the Hybrid Semantic Loss to extract high-level semantics. 2) We
introduce the Hierarchical Geometry Aggregation strategy in the instruction
tuning stage. This incorporates inductive bias into the LLM early layers to
focus on the local details of the point clouds. To the end, we present the
first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current
state-of-the-art model, ShapeLLM-13B, achieving 55.0%, 50.92%, and 42.7% on the
classification, captioning, and VQA tasks, respectively. Our results
demonstrate that the encoder-free architecture is highly promising for
replacing encoder-based architectures in the field of 3D understanding. The
code is released at https://github.com/Ivan-Tang-3D/ENELSummary
AI-Generated Summary