ChatPaper.aiChatPaper

SAR3D:通过多尺度3D VQVAE进行自回归式三维物体生成与理解

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE

November 25, 2024
作者: Yongwei Chen, Yushi Lan, Shangchen Zhou, Tengfei Wang, XIngang Pan
cs.AI

摘要

自回归模型在各个领域取得了显著成功,从大型语言模型(LLMs)到大型多模态模型(LMMs)和2D内容生成,逐渐接近人工通用智能(AGI)。尽管取得了这些进展,但将自回归方法应用于3D对象的生成和理解仍然是一个相对未被探索的领域。本文介绍了Scale AutoRegressive 3D(SAR3D),这是一个新颖的框架,利用多尺度3D矢量量化变分自编码器(VQVAE)对3D对象进行标记化,以实现高效的自回归生成和详细理解。通过在多尺度潜在表示中预测下一个尺度,而不是下一个单个标记,SAR3D显著减少了生成时间,仅需0.82秒即可在A6000 GPU上实现快速3D对象生成。此外,鉴于标记富含分层3D感知信息,我们对预训练的LLM进行微调,实现对3D内容的多模态理解。我们的实验表明,SAR3D在速度和质量上超越了当前的3D生成方法,并使LLMs能够全面解释和描述3D模型。
English
Autoregressive models have demonstrated remarkable success across various fields, from large language models (LLMs) to large multimodal models (LMMs) and 2D content generation, moving closer to artificial general intelligence (AGI). Despite these advances, applying autoregressive approaches to 3D object generation and understanding remains largely unexplored. This paper introduces Scale AutoRegressive 3D (SAR3D), a novel framework that leverages a multi-scale 3D vector-quantized variational autoencoder (VQVAE) to tokenize 3D objects for efficient autoregressive generation and detailed understanding. By predicting the next scale in a multi-scale latent representation instead of the next single token, SAR3D reduces generation time significantly, achieving fast 3D object generation in just 0.82 seconds on an A6000 GPU. Additionally, given the tokens enriched with hierarchical 3D-aware information, we finetune a pretrained LLM on them, enabling multimodal comprehension of 3D content. Our experiments show that SAR3D surpasses current 3D generation methods in both speed and quality and allows LLMs to interpret and caption 3D models comprehensively.

Summary

AI-Generated Summary

PDF132November 27, 2024