協作解碼使視覺自回歸建模更有效率。

摘要

在快速發展的影像生成領域中，視覺自回歸（VAR）建模因其創新的下一階段預測方法而引起廣泛關注。這種範式在效率、可擴展性和零樣本泛化方面帶來了顯著改進。然而，VAR固有的由粗到細的特性引入了較長的標記序列，導致內存消耗和計算冗餘變得難以承受。為了解決這些瓶頸，我們提出了協同解碼（CoDe），這是一種針對VAR框架量身定制的新型高效解碼策略。CoDe基於兩個關鍵觀察結果：在較大尺度上大幅降低了參數需求，以及不同尺度之間具有獨特生成模式。基於這些見解，我們將多尺度推理過程劃分為大模型和小模型之間的無縫協作。大模型充當“起草者”，專門生成較小尺度的低頻內容，而小模型則充當“精煉者”，僅專注於在較大尺度上預測高頻細節。這種協作方式實現了顯著的高效率，對質量幾乎沒有影響：CoDe實現了1.7倍的加速，將內存使用量減少約50％，並將圖像質量從FID僅從1.95增加到1.98。當進一步減少起草步驟時，CoDe可以實現令人印象深刻的2.9倍加速比，並在單個NVIDIA 4090 GPU上以256x256分辨率達到41張/秒的速度，同時保持了令人讚賞的FID值為2.27。代碼可在https://github.com/czg1225/CoDe找到。

English

In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at https://github.com/czg1225/CoDe

協作解碼使視覺自回歸建模更有效率。

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

摘要

Support