渐进式渲染蒸馏：无需3D数据，将稳定扩散模型适配于即时文本到网格生成

摘要

获得一个能够在几秒内从文本提示生成高质量三维网格的模型极具吸引力。尽管近期研究尝试将预训练的文本到图像扩散模型（如Stable Diffusion, SD）改造为三维表示生成器（例如Triplane），但由于缺乏足够的高质量三维训练数据，这些方法往往生成效果欠佳。为克服数据短缺问题，我们提出了一种新颖的训练方案——渐进式渲染蒸馏（Progressive Rendering Distillation, PRD），通过蒸馏多视角扩散模型并调整SD为原生三维生成器，无需依赖三维真实数据。在每次训练迭代中，PRD利用U-Net从随机噪声逐步去噪若干步，并在每一步将去噪后的潜在空间解码为三维输出。结合SD，多视角扩散模型（包括MVDream和RichDreamer）通过分数蒸馏将文本一致的纹理和几何信息融入三维输出。由于PRD支持无三维真实数据的训练，我们能够轻松扩展训练数据规模，并提升对具有创意概念的挑战性文本提示的生成质量。同时，PRD能在仅几步内加速生成模型的推理速度。借助PRD，我们训练了一个Triplane生成器，命名为TriplaneTurbo，它仅增加了2.5%的可训练参数来适配SD进行Triplane生成。TriplaneTurbo在效率和质量上均优于以往的文本到三维生成器，具体而言，它能在1.2秒内生成高质量三维网格，并对挑战性文本输入展现出良好的泛化能力。代码已公开于https://github.com/theEricMa/TriplaneTurbo。

English

It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only 2.5% trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.

渐进式渲染蒸馏：无需3D数据，将稳定扩散模型适配于即时文本到网格生成

Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

摘要

Summary

Support

Support