SAR3D: 다중 스케일 3D VQVAE를 통한 자기 회귀 3D 객체 생성 및 이해

초록

자기회귀 모델은 다양한 분야에서 높은 성공을 보여주었는데, 대형 언어 모델 (LLM)부터 대형 다중 모달 모델 (LMM) 및 2D 콘텐츠 생성에 이르기까지 다양한 분야에서 인공 일반 지능 (AGI)에 한 걸음 더 가까워졌습니다. 이러한 발전에도 불구하고, 3D 객체 생성 및 이해에 자기회귀 접근 방식을 적용하는 것은 아직 크게 탐구되지 않았습니다. 본 논문에서는 3D 객체를 위해 효율적인 자기회귀 생성과 상세한 이해를 위해 3D 객체를 토큰화하는 새로운 프레임워크인 Scale AutoRegressive 3D (SAR3D)를 소개합니다. SAR3D는 다음 단일 토큰이 아닌 다음 다단계 잠재 표현에서 다음 규모를 예측함으로써 생성 시간을 크게 줄이고, A6000 GPU에서 단 0.82초 만에 빠른 3D 객체 생성을 달성합니다. 또한, 계층적 3D-인식 정보로 풍부해진 토큰을 활용하여 사전 훈련된 LLM을 세밀하게 조정하여 3D 콘텐츠의 다중 모달 이해를 가능케 합니다. 실험 결과, SAR3D가 현재의 3D 생성 방법을 속도와 품질 양측에서 능가하며, LLM이 3D 모델을 종합적으로 해석하고 캡션을 달 수 있게 합니다.

English

Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.

SAR3D: 다중 스케일 3D VQVAE를 통한 자기 회귀 3D 객체 생성 및 이해

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

초록

Summary

Support