LLaMA-Mesh：將3D網格生成與語言模型統一化

摘要

本研究探討擴展大型語言模型（LLMs）的能力，透過在統一模型中預訓練的文本生成3D網格。這提供了關鍵優勢，包括（1）利用已嵌入LLMs中的空間知識，來自於文本來源如3D教程，以及（2）實現對話式3D生成和網格理解。主要挑戰之一是有效地將3D網格數據分詞為LLMs可以無縫處理的離散標記。為了應對這一挑戰，我們引入了LLaMA-Mesh，一種新穎的方法，將3D網格的頂點坐標和面定義表示為純文本，從而實現與LLMs的直接集成，而無需擴展詞彙表。我們構建了一個監督微調（SFT）數據集，使預訓練的LLMs能夠（1）從文本提示生成3D網格，（2）根據需要生成交錯的文本和3D網格輸出，以及（3）理解和解釋3D網格。我們的研究首次證明，LLMs可以被微調以獲得用於3D網格生成的複雜空間知識，以文本形式呈現，有效地統一了3D和文本模態。LLaMA-Mesh實現了與從頭開始訓練的模型相當的網格生成質量，同時保持較強的文本生成性能。

English

This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.

LLaMA-Mesh：將3D網格生成與語言模型統一化

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

摘要

Support