基于残差矢量量化的高效生成建模与标记

摘要

我们探讨了在向量量化生成模型中使用残差向量量化（RVQ）实现高保真度生成的方法。这种量化技术通过使用更深入的标记保持了更高的数据保真度。然而，在生成模型中增加标记数量会导致推理速度变慢。为此，我们引入了ResGen，这是一种基于RVQ的高效离散扩散模型，可以生成高保真度样本而不影响采样速度。我们的关键思想是直接预测集体标记的向量嵌入，而不是单个标记。此外，我们证明了我们提出的标记屏蔽和多标记预测方法可以在一个基于离散扩散过程和变分推断的原则性概率框架内加以表述。我们验证了所提出方法在两个具有挑战性的任务上的有效性和泛化能力，涵盖不同模态：在ImageNet 256x256上的有条件图像生成和零样本文本转语音合成。实验结果表明，ResGen在这两个任务中均优于自回归对应方法，在不影响采样速度的情况下提供了更优异的性能。此外，随着RVQ深度的增加，我们的生成模型在生成保真度或采样速度方面均优于相同规模的基准模型。项目页面位于https://resgen-genai.github.io。

English

We explore the use of Residual Vector Quantization (RVQ) for high-fidelity generation in vector-quantized generative models. This quantization technique maintains higher data fidelity by employing more in-depth tokens. However, increasing the token number in generative models leads to slower inference speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete diffusion model that generates high-fidelity samples without compromising sampling speed. Our key idea is a direct prediction of vector embedding of collective tokens rather than individual ones. Moreover, we demonstrate that our proposed token masking and multi-token prediction method can be formulated within a principled probabilistic framework using a discrete diffusion process and variational inference. We validate the efficacy and generalizability of the proposed method on two challenging tasks across different modalities: conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech synthesis. Experimental results demonstrate that ResGen outperforms autoregressive counterparts in both tasks, delivering superior performance without compromising sampling speed. Furthermore, as we scale the depth of RVQ, our generative models exhibit enhanced generation fidelity or faster sampling speeds compared to similarly sized baseline models. The project page can be found at https://resgen-genai.github.io

基于残差矢量量化的高效生成建模与标记

Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

摘要

Summary

Support