LightGen:通过知识蒸馏与直接偏好优化实现高效图像生成
LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization
March 11, 2025
作者: Xianfeng Wu, Yajing Bai, Haoze Zheng, Harold Haodong Chen, Yexin Liu, Zihao Wang, Xuran Ma, Wen-Jie Shu, Xianzu Wu, Harry Yang, Ser-Nam Lim
cs.AI
摘要
近期,文本到图像生成领域的进展主要依赖于大规模数据集和参数密集的架构。这些要求严重限制了缺乏充足计算资源的研究者和实践者的可及性。本文提出了一种高效的图像生成模型训练范式——\model,它结合了知识蒸馏(KD)和直接偏好优化(DPO)技术。借鉴多模态大语言模型(MLLMs)中广泛采用的数据知识蒸馏技术的成功经验,LightGen将最先进的(SOTA)文本到图像模型的知识蒸馏至仅含0.7B参数的紧凑型掩码自回归(MAR)架构中。通过使用一个仅包含200万张由多样化描述生成的高质量图像的紧凑合成数据集,我们证明了数据多样性在决定模型性能方面远胜于数据量。这一策略大幅降低了计算需求,并将预训练时间从可能的上千GPU天缩短至仅88GPU天。此外,针对合成数据固有的缺陷,尤其是高频细节不足和空间定位不准确的问题,我们引入了DPO技术,以提升图像的逼真度和位置精度。全面的实验证实,LightGen在显著减少计算资源的同时,实现了与SOTA模型相当的图像生成质量,从而为资源受限的环境拓宽了应用可能性。代码已发布于https://github.com/XianfengWu01/LightGen。
English
Recent advances in text-to-image generation have primarily relied on
extensive datasets and parameter-heavy architectures. These requirements
severely limit accessibility for researchers and practitioners who lack
substantial computational resources. In this paper, we introduce \model, an
efficient training paradigm for image generation models that uses knowledge
distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration
from the success of data KD techniques widely adopted in Multi-Modal Large
Language Models (MLLMs), LightGen distills knowledge from state-of-the-art
(SOTA) text-to-image models into a compact Masked Autoregressive (MAR)
architecture with only 0.7B parameters. Using a compact synthetic dataset of
just 2M high-quality images generated from varied captions, we demonstrate
that data diversity significantly outweighs data volume in determining model
performance. This strategy dramatically reduces computational demands and
reduces pre-training time from potentially thousands of GPU-days to merely 88
GPU-days. Furthermore, to address the inherent shortcomings of synthetic data,
particularly poor high-frequency details and spatial inaccuracies, we integrate
the DPO technique that refines image fidelity and positional accuracy.
Comprehensive experiments confirm that LightGen achieves image generation
quality comparable to SOTA models while significantly reducing computational
resources and expanding accessibility for resource-constrained environments.
Code is available at https://github.com/XianfengWu01/LightGenSummary
AI-Generated Summary