MaskBit:通過位元標記實現無嵌入式圖像生成
MaskBit: Embedding-free Image Generation via Bit Tokens
September 24, 2024
作者: Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen
cs.AI
摘要
遮罩變壓器模型用於條件圖像生成已成為擴散模型的一個引人注目的替代方案。通常包括兩個階段 - 初始的VQGAN模型用於在潛在空間和圖像空間之間過渡,以及後續的變壓器模型用於在潛在空間內進行圖像生成 - 這些框架為圖像合成提供了有前途的途徑。在這項研究中,我們提出了兩個主要貢獻:首先,對VQGAN進行了實證和系統化的研究,從而推出了現代化的VQGAN。其次,提出了一種新穎的無嵌入生成網絡,直接在位元標記上運行 - 這是具有豐富語義的位元標記的二進制量化表示。第一個貢獻提供了一個透明、可重現且高性能的VQGAN模型,增強了可訪問性,並匹配了當前最先進方法的性能,同時揭示了先前未公開的細節。第二個貢獻表明,使用位元標記進行無嵌入圖像生成實現了ImageNet 256x256基準測試中新的最先進FID值為1.52,並且僅具有305M參數的緊湊生成器模型。
English
Masked transformer models for class-conditional image generation have become
a compelling alternative to diffusion models. Typically comprising two stages -
an initial VQGAN model for transitioning between latent space and image space,
and a subsequent Transformer model for image generation within latent space -
these frameworks offer promising avenues for image synthesis. In this study, we
present two primary contributions: Firstly, an empirical and systematic
examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel
embedding-free generation network operating directly on bit tokens - a binary
quantized representation of tokens with rich semantics. The first contribution
furnishes a transparent, reproducible, and high-performing VQGAN model,
enhancing accessibility and matching the performance of current
state-of-the-art methods while revealing previously undisclosed details. The
second contribution demonstrates that embedding-free image generation using bit
tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256x256
benchmark, with a compact generator model of mere 305M parameters.Summary
AI-Generated Summary