在文本到图像生成领域,ImageNet能带我们走多远?
How far can we go with ImageNet for Text-to-Image generation?
February 28, 2025
作者: L. Degeorge, A. Ghosh, N. Dufour, D. Picard, V. Kalogeiton
cs.AI
摘要
近期,文本到图像(T2I)生成模型通过在数十亿规模的数据集上训练,遵循“越大越好”的范式,取得了显著成果,这一范式强调数据量而非质量。我们对此既定范式提出挑战,通过展示对小型、精心策划的数据集进行策略性数据增强,能够媲美甚至超越基于大规模网络爬取数据训练的模型。仅利用经过精心设计的文本与图像增强技术优化的ImageNet数据集,我们在GenEval评估中比SD-XL高出2分,在DPGBench上高出5分,同时仅使用了十分之一的参数和千分之一的训练图像。我们的研究结果表明,策略性数据增强,而非庞大的数据集,可能为T2I生成提供一条更为可持续的发展路径。
English
Recent text-to-image (T2I) generation models have achieved remarkable results
by training on billion-scale datasets, following a `bigger is better' paradigm
that prioritizes data quantity over quality. We challenge this established
paradigm by demonstrating that strategic data augmentation of small,
well-curated datasets can match or outperform models trained on massive
web-scraped collections. Using only ImageNet enhanced with well-designed text
and image augmentations, we achieve a +2 overall score over SD-XL on GenEval
and +5 on DPGBench while using just 1/10th the parameters and 1/1000th the
training images. Our results suggest that strategic data augmentation, rather
than massive datasets, could offer a more sustainable path forward for T2I
generation.Summary
AI-Generated Summary