LLaMA-Mesh：将3D网格生成与语言模型统一起来

摘要

本研究探讨了扩展大型语言模型（LLMs）的能力，这些模型在文本上进行预训练，以生成3D网格在一个统一模型内。这提供了关键优势，即（1）利用已嵌入在LLMs中的空间知识，源自文本来源如3D教程，以及（2）实现对话式3D生成和网格理解。一个主要挑战是有效地将3D网格数据标记为LLMs可以无缝处理的离散标记。为了解决这个问题，我们引入了LLaMA-Mesh，一种新颖的方法，将3D网格的顶点坐标和面定义表示为纯文本，允许直接与LLMs集成而无需扩展词汇表。我们构建了一个监督微调（SFT）数据集，使预训练的LLMs能够（1）从文本提示生成3D网格，（2）根据需要生成交错的文本和3D网格输出，以及（3）理解和解释3D网格。我们的工作首次证明了LLMs可以被微调以获取用于3D网格生成的复杂空间知识，以文本为基础的格式，有效地统一了3D和文本模态。LLaMA-Mesh在保持强大文本生成性能的同时，实现了与从头开始训练的模型相当的网格生成质量。

English

This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) enabling conversational 3D generation and mesh understanding. A primary challenge is effectively tokenizing 3D mesh data into discrete tokens that LLMs can process seamlessly. To address this, we introduce LLaMA-Mesh, a novel approach that represents the vertex coordinates and face definitions of 3D meshes as plain text, allowing direct integration with LLMs without expanding the vocabulary. We construct a supervised fine-tuning (SFT) dataset enabling pretrained LLMs to (1) generate 3D meshes from text prompts, (2) produce interleaved text and 3D mesh outputs as required, and (3) understand and interpret 3D meshes. Our work is the first to demonstrate that LLMs can be fine-tuned to acquire complex spatial knowledge for 3D mesh generation in a text-based format, effectively unifying the 3D and text modalities. LLaMA-Mesh achieves mesh generation quality on par with models trained from scratch while maintaining strong text generation performance.

LLaMA-Mesh：将3D网格生成与语言模型统一起来

LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models

摘要

Support