MARVEL-40M+: 多層次視覺闡釋，用於高保真度文本轉3D內容創作

摘要

從文本提示生成高保真度的3D內容仍然是計算機視覺中的一個重要挑戰，這是由於現有數據集的規模、多樣性和標註深度有限。為了應對這一挑戰，我們引入了MARVEL-40M+，這是一個包含4000萬文本標註的龐大數據集，涵蓋了從七個主要3D數據集中匯總的超過890萬個3D資產。我們的貢獻是一種新穎的多階段標註流程，該流程整合了開源預訓練的多視圖VLM和LLM，以自動生成從詳細（150-200字）到簡潔語義標籤（10-20字）的多級描述。這種結構支持精細的3D重建和快速原型設計。此外，我們將來自源數據集的人類元數據納入我們的標註流程中，以在標註中添加特定領域的信息並減少VLM的幻覺。此外，我們開發了MARVEL-FX3D，這是一個兩階段的文本到3D流程。我們使用我們的標註對Stable Diffusion進行微調，並使用預訓練的圖像到3D網絡在15秒內生成3D紋理網格。廣泛的評估顯示，MARVEL-40M+在標註質量和語言多樣性方面明顯優於現有數據集，通過GPT-4達到了72.41%的勝率，通過人類評估者達到了73.40%的勝率。

English

Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators.

MARVEL-40M+: 多層次視覺闡釋，用於高保真度文本轉3D內容創作

MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

摘要

Support