协作解码使视觉自回归建模更高效。

摘要

在图像生成领域迅速发展中，视觉自回归（VAR）建模因其创新的下一级别预测方法而受到广泛关注。这一范式在效率、可扩展性和零样本泛化方面带来了显著改进。然而，VAR固有的由粗到细的特性引入了一个较长的标记序列，导致了内存消耗和计算冗余的增加。为解决这些瓶颈，我们提出了协作解码（CoDe），这是一种针对VAR框架量身定制的新型高效解码策略。CoDe基于两个关键观察结果：在较大尺度上参数需求大幅降低，以及不同尺度间存在独特的生成模式。基于这些见解，我们将多尺度推理过程划分为大模型和小模型之间的无缝协作。大模型充当“起草者”，专门负责在较小尺度生成低频内容，而小模型充当“完善者”，仅专注于在较大尺度预测高频细节。这种协作方式在保持图像质量几乎不受影响的同时实现了显著的效率提升：CoDe实现了1.7倍的加速，将内存使用量减少约50％，并将图像质量从FID从1.95略微增加至1.98。当进一步减少起草步骤时，CoDe可以实现令人印象深刻的2.9倍加速比，达到在单个NVIDIA 4090 GPU上以256x256分辨率生成41张图像/秒，同时保持出色的FID为2.27。代码可在https://github.com/czg1225/CoDe上找到。

English

In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at https://github.com/czg1225/CoDe

协作解码使视觉自回归建模更高效。

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

摘要

Summary

Support

Support