基於殘差向量量化的高效生成建模與標記
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
December 13, 2024
作者: Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho
cs.AI
摘要
我們探討了在向量量化生成模型中使用剩餘向量量化(RVQ)以實現高保真度生成的應用。這種量化技術通過使用更深入的標記來保持更高的數據保真度。然而,在生成模型中增加標記數量會導致推理速度變慢。為此,我們引入了ResGen,一種基於高效RVQ的離散擴散模型,可以生成高保真度樣本而不影響取樣速度。我們的關鍵思想是直接預測集體標記的向量嵌入,而不是單個標記。此外,我們展示了我們提出的標記遮罩和多標記預測方法可以在一個基於原則的概率框架中進行形式化,使用離散擴散過程和變分推斷。我們在不同模態之間的兩個具有挑戰性的任務上驗證了所提出方法的功效和泛化能力:在ImageNet 256x256上的有條件圖像生成和零樣本文本轉語音合成。實驗結果表明,在這兩個任務中,ResGen在不影響取樣速度的情況下優於自回歸對應物,提供卓越性能。此外,隨著RVQ深度的增加,我們的生成模型在生成保真度或取樣速度方面均優於同等大小的基線模型。項目頁面位於https://resgen-genai.github.io。
English
We explore the use of Residual Vector Quantization (RVQ) for high-fidelity
generation in vector-quantized generative models. This quantization technique
maintains higher data fidelity by employing more in-depth tokens. However,
increasing the token number in generative models leads to slower inference
speeds. To this end, we introduce ResGen, an efficient RVQ-based discrete
diffusion model that generates high-fidelity samples without compromising
sampling speed. Our key idea is a direct prediction of vector embedding of
collective tokens rather than individual ones. Moreover, we demonstrate that
our proposed token masking and multi-token prediction method can be formulated
within a principled probabilistic framework using a discrete diffusion process
and variational inference. We validate the efficacy and generalizability of the
proposed method on two challenging tasks across different modalities:
conditional image generation} on ImageNet 256x256 and zero-shot text-to-speech
synthesis. Experimental results demonstrate that ResGen outperforms
autoregressive counterparts in both tasks, delivering superior performance
without compromising sampling speed. Furthermore, as we scale the depth of RVQ,
our generative models exhibit enhanced generation fidelity or faster sampling
speeds compared to similarly sized baseline models. The project page can be
found at https://resgen-genai.github.ioSummary
AI-Generated Summary