SnapGen：通过高效架构和训练，驯服移动设备上的高分辨率文本到图像模型

摘要

现有的文本到图像（T2I）扩散模型面临着几个限制，包括庞大的模型尺寸、运行速度慢以及在移动设备上生成质量低的问题。本文旨在通过开发一个极小且快速的T2I模型来解决所有这些挑战，该模型能够在移动平台上生成高分辨率和高质量的图像。我们提出了几种技术来实现这一目标。首先，我们系统地研究了网络架构的设计选择，以减少模型参数和延迟，同时确保高质量的生成。其次，为了进一步提高生成质量，我们采用了跨架构知识蒸馏的方法，从一个规模更大的模型中进行多层次的引导，指导我们的模型从头开始训练。第三，我们通过将对抗引导与知识蒸馏相结合，实现了少步生成。我们的模型 SnapGen 首次在移动设备上展示了在大约1.4秒内生成1024x1024像素图像的能力。在ImageNet-1K上，我们的模型仅具有3.72亿参数，在256x256像素生成中实现了2.06的FID。在T2I基准测试中（如GenEval和DPG-Bench），我们的模型仅具有3.79亿参数，超越了拥有数十亿参数的大规模模型，而且体积明显更小（例如，比SDXL小7倍，比IF-XL小14倍）。

English

Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).

SnapGen：通过高效架构和训练，驯服移动设备上的高分辨率文本到图像模型

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

摘要

Support