인수분해된 시각 토큰화와 생성

초록

시각 토크나이저는 이미지 생성에 기본적입니다. 이들은 시각 데이터를 이산 토큰으로 변환하여 트랜스포머 기반 모델이 이미지 생성에서 뛰어난 성과를 거둘 수 있게 합니다. 그들의 성공에도 불구하고, VQGAN과 같은 VQ 기반 토크나이저는 제한된 어휘 크기로 인한 중요한 제약으로 직면합니다. 코드북을 단순히 확장하는 것은 종종 훈련 불안정성과 성능 저하로 이어지며, 확장성이 중요한 과제가 됩니다. 본 연구에서는 Factorized Quantization (FQ)이라는 혁신적인 방법을 소개하여 VQ 기반 토크나이저를 부활시킵니다. 이 방법은 대규모 코드북을 여러 독립적인 하위 코드북으로 분해함으로써 큰 코드북의 조회 복잡성을 줄이고 더 효율적이고 확장 가능한 시각 토큰화를 가능하게 합니다. 각 하위 코드북이 구별되고 보완적인 정보를 포착하도록 보장하기 위해 중복을 명시적으로 줄이고 하위 코드북 간 다양성을 촉진하는 disentanglement regularization을 제안합니다. 더불어, 훈련 과정에서 표현 학습을 통합하여 CLIP와 DINO와 같은 사전 훈련된 비전 모델을 활용하여 의미론적 풍부함을 학습된 표현에 주입합니다. 이 설계는 우리의 토크나이저가 다양한 의미 수준을 포착하도록 보장하여 더 표현적이고 분리된 표현을 이끌어냅니다. 실험 결과 제안된 FQGAN 모델이 시각 토크나이저의 재구성 품질을 상당히 향상시켜 최첨단 성능을 달성함을 보여줍니다. 더불어, 이 토크나이저가 효과적으로 자기 회귀적 이미지 생성으로 적응될 수 있음을 입증합니다. https://showlab.github.io/FQGAN

English

Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation. https://showlab.github.io/FQGAN

인수분해된 시각 토큰화와 생성

Factorized Visual Tokenization and Generation

초록

Summary

Support