이미지 자기회귀 모델링을 위한 잠재 공간 안정화: 통합적 시각

초록

잠재 기반 이미지 생성 모델인 잠재 확산 모델(LDMs)과 마스크 이미지 모델(MIMs)은 이미지 생성 작업에서 주목할만한 성과를 거두었습니다. 이러한 모델들은 일반적으로 VQGAN 또는 VAE와 같은 재구성 오토인코더를 활용하여 픽셀을 더 간결한 잠재 공간으로 인코딩하고 픽셀이 아닌 잠재 공간에서 데이터 분포를 학습합니다. 그러나 이러한 접근은 중요한 질문을 불러일으킵니다. 과연 최적의 선택일까요? 이에 대한 대답으로, 우리는 흥미로운 관찰로 시작합니다. 동일한 잠재 공간을 공유하더라도 자기 회귀 모델은 이미지 생성에서 LDMs와 MIMs에 크게 뒤처지는 것으로 나타났습니다. 이 결과는 NLP 분야와 대조적인데, 거기서는 자기 회귀 모델인 GPT가 지배적인 위치를 확립했습니다. 이러한 불일치에 대응하기 위해, 우리는 잠재 공간과 생성 모델 간의 관계에 대한 통합적인 시각을 제시하며, 이미지 생성 모델링에서 잠재 공간의 안정성을 강조합니다. 더불어, 이미지 생성 모델링을 위해 잠재 공간을 안정화하기 위한 간단하면서 효과적인 이산 이미지 토크나이저를 제안합니다. 실험 결과는 우리의 토크나이저(DiGIT)를 사용한 이미지 자기 회귀 모델링이 다음 토큰 예측 원칙을 통해 이미지 이해와 이미지 생성에 혜택을 준다는 것을 보여줍니다. 이는 GPT 모델에게는 당연한 일이지만 다른 생성 모델에게는 어려운 작업입니다. 놀랍게도, 이미지용 GPT 스타일의 자기 회귀 모델이 처음으로 LDMs를 능가하며, 모델 크기를 확장할 때 GPT와 유사한 큰 개선을 보입니다. 우리의 발견은 최적화된 잠재 공간과 이산 토크나이제이션의 통합이 이미지 생성 모델의 능력을 발전시키는 잠재력을 강조합니다. 코드는 https://github.com/DAMO-NLP-SG/DiGIT에서 확인할 수 있습니다.

English

Latent-based image generative models, such as Latent Diffusion Models (LDMs) and Mask Image Models (MIMs), have achieved notable success in image generation tasks. These models typically leverage reconstructive autoencoders like VQGAN or VAE to encode pixels into a more compact latent space and learn the data distribution in the latent space instead of directly from pixels. However, this practice raises a pertinent question: Is it truly the optimal choice? In response, we begin with an intriguing observation: despite sharing the same latent space, autoregressive models significantly lag behind LDMs and MIMs in image generation. This finding contrasts sharply with the field of NLP, where the autoregressive model GPT has established a commanding presence. To address this discrepancy, we introduce a unified perspective on the relationship between latent space and generative models, emphasizing the stability of latent space in image generative modeling. Furthermore, we propose a simple but effective discrete image tokenizer to stabilize the latent space for image generative modeling. Experimental results show that image autoregressive modeling with our tokenizer (DiGIT) benefits both image understanding and image generation with the next token prediction principle, which is inherently straightforward for GPT models but challenging for other generative models. Remarkably, for the first time, a GPT-style autoregressive model for images outperforms LDMs, which also exhibits substantial improvement akin to GPT when scaling up model size. Our findings underscore the potential of an optimized latent space and the integration of discrete tokenization in advancing the capabilities of image generative models. The code is available at https://github.com/DAMO-NLP-SG/DiGIT.

이미지 자기회귀 모델링을 위한 잠재 공간 안정화: 통합적 시각

Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective

초록

Support