SAR3D:通過多尺度3D VQVAE實現自回歸式3D物體生成和理解

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

November 25, 2024
作者: Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan
cs.AI

摘要

自回歸模型在各個領域展現了卓越的成功,從大型語言模型(LLMs)到大型多模態模型(LMMs)和2D內容生成,逐漸接近人工通用智能(AGI)。儘管取得這些進展,將自回歸方法應用於3D物體生成和理解仍然是一個未被廣泛探索的領域。本文介紹了尺度自回歸3D(SAR3D),這是一個新穎的框架,利用多尺度3D向量量化變分自編碼器(VQVAE)將3D物體進行標記化,以實現高效的自回歸生成和詳細理解。通過在多尺度潛在表示中預測下一個尺度,而非下一個單個標記,SAR3D顯著降低了生成時間,僅需在A6000 GPU上的0.82秒內實現快速3D物體生成。此外,由於標記中富含層次化的3D感知信息,我們對預訓練的LLM進行微調,實現對3D內容的多模態理解。我們的實驗表明,SAR3D在速度和質量上超越了當前的3D生成方法,並使LLMs能夠全面解釋和標註3D模型。
English
Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.

Summary

AI-Generated Summary

PDF112November 27, 2024