SnapGen: 휴대 기기용 고품질 텍스트 대 이미지 모델을 다루는 데 효율적인 아키텍처와 훈련

초록

기존의 텍스트 대 이미지(T2I) 확산 모델은 대규모 모델 크기, 느린 런타임 및 모바일 장치에서의 저품질 생성 등 여러 가지 제한 사항에 직면하고 있습니다. 본 논문은 이러한 모든 도전 과제를 해결하기 위해 매우 작고 빠른 T2I 모델을 개발하여 모바일 플랫폼에서 고해상도 및 고품질 이미지를 생성하는 것을 목표로 합니다. 이를 달성하기 위해 여러 기술을 제안합니다. 먼저, 모델 파라미터 및 지연 시간을 줄이면서 높은 품질의 생성을 보장하기 위해 네트워크 아키텍처의 설계 선택 사항을 체계적으로 조사합니다. 둘째, 생성 품질을 더 향상시키기 위해 큰 모델로부터 교차 아키텍처 지식 증류를 활용하며, 우리 모델의 훈련을 처음부터 안내하는 다중 수준 접근 방식을 사용합니다. 셋째, 적대적 안내를 지식 증류와 통합하여 몇 단계 생성을 가능하게 합니다. 우리의 모델 SnapGen은 모바일 장치에서 1.4초 정도에 1024x1024 px 이미지를 생성하는 것을 처음으로 증명합니다. ImageNet-1K에서 우리의 모델은 단 372M 개의 파라미터로 256x256 px 생성에 대해 FID가 2.06을 달성합니다. T2I 벤치마크인 GenEval 및 DPG-Bench에서, 우리의 모델은 379M 개의 파라미터로, SDXL보다 7배, IF-XL보다 14배 작은 크기로 수십억 개의 파라미터를 가진 대규모 모델을 능가합니다.

English

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).

SnapGen: 휴대 기기용 고품질 텍스트 대 이미지 모델을 다루는 데 효율적인 아키텍처와 훈련

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

초록

Support