SnapGen:通過高效架構和訓練對移動設備進行高分辨率文本到圖像模型的控制。

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

December 12, 2024
作者: Dongting Hu, Jierun Chen, Xijie Huang, Huseyin Coskun, Arpit Sahni, Aarush Gupta, Anujraaj Goyal, Dishani Lahiri, Rajesh Singh, Yerlan Idelbayev, Junli Cao, Yanyu Li, Kwang-Ting Cheng, S. -H. Gary Chan, Mingming Gong, Sergey Tulyakov, Anil Kag, Yanwu Xu, Jian Ren
cs.AI

摘要

現有的文字轉圖(T2I)擴散模型面臨著幾個限制,包括龐大的模型大小、運行速度緩慢,以及在移動設備上生成低質量圖像。本文旨在通過開發一個極小且快速的T2I模型,以在移動平台上生成高分辨率和高質量圖像,來應對所有這些挑戰。我們提出了幾種技術來實現這一目標。首先,我們系統地研究了網絡架構的設計選擇,以減少模型參數和延遲,同時確保高質量生成。其次,為了進一步提高生成質量,我們從一個更大的模型中採用跨架構知識蒸餾,使用多級方法來引導我們的模型從頭開始訓練。第三,我們通過將對抗引導與知識蒸餾相結合,實現了幾步生成。我們的模型SnapGen 首次在移動設備上展示了在約1.4秒內生成1024x1024像素圖像。在ImageNet-1K上,我們的模型僅使用了3.72億個參數,在256x256像素生成中實現了2.06的FID。在T2I基準測試(例如GenEval和DPG-Bench)中,我們的模型僅使用3.79億個參數,超越了擁有數十億參數的大型模型,並且尺寸明顯更小(例如比SDXL小7倍,比IF-XL小14倍)。
English
Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).

Summary

AI-Generated Summary

PDF223December 13, 2024