ChatPaper.aiChatPaper

SANA 1.5:线性扩散变压器中训练时间和推断时间计算的高效扩展

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

January 30, 2025
作者: Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, Bingchen Liu, Daquan Zhou, Song Han
cs.AI

摘要

本文介绍了SANA-1.5,这是一种用于文本到图像生成的线性扩散变压器,可实现高效扩展。在SANA-1.0的基础上,我们引入了三项关键创新:(1) 高效训练扩展:采用深度增长范式,使模型参数从16亿扩展到48亿,大幅减少计算资源的同时结合了内存高效的8位优化器。(2) 模型深度修剪:一种用于高效模型压缩至任意大小的块重要性分析技术,质量损失最小。(3) 推理时扩展:一种重复采样策略,以交换计算量为模型容量,使较小的模型在推理时能够达到与较大模型相匹配的质量。通过这些策略,SANA-1.5在GenEval上实现了0.72的文本-图像对齐分数,通过推理扩展可进一步提高至0.80,在GenEval基准上确立了新的SoTA。这些创新能够在不同计算预算下实现高效的模型扩展,同时保持高质量,使高质量图像生成更加易于实现。
English
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.

Summary

AI-Generated Summary

PDF182February 1, 2025