協作解碼使視覺自回歸建模更有效率。
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
November 26, 2024
作者: Zigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang
cs.AI
摘要
在快速發展的影像生成領域中,視覺自回歸(VAR)建模因其創新的下一階段預測方法而引起廣泛關注。這種範式在效率、可擴展性和零樣本泛化方面帶來了顯著改進。然而,VAR固有的由粗到細的特性引入了較長的標記序列,導致內存消耗和計算冗餘變得難以承受。為了解決這些瓶頸,我們提出了協同解碼(CoDe),這是一種針對VAR框架量身定制的新型高效解碼策略。CoDe基於兩個關鍵觀察結果:在較大尺度上大幅降低了參數需求,以及不同尺度之間具有獨特生成模式。基於這些見解,我們將多尺度推理過程劃分為大模型和小模型之間的無縫協作。大模型充當“起草者”,專門生成較小尺度的低頻內容,而小模型則充當“精煉者”,僅專注於在較大尺度上預測高頻細節。這種協作方式實現了顯著的高效率,對質量幾乎沒有影響:CoDe實現了1.7倍的加速,將內存使用量減少約50%,並將圖像質量從FID僅從1.95增加到1.98。當進一步減少起草步驟時,CoDe可以實現令人印象深刻的2.9倍加速比,並在單個NVIDIA 4090 GPU上以256x256分辨率達到41張/秒的速度,同時保持了令人讚賞的FID值為2.27。代碼可在https://github.com/czg1225/CoDe找到。
English
In the rapidly advancing field of image generation, Visual Auto-Regressive
(VAR) modeling has garnered considerable attention for its innovative
next-scale prediction approach. This paradigm offers substantial improvements
in efficiency, scalability, and zero-shot generalization. Yet, the inherently
coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to
prohibitive memory consumption and computational redundancies. To address these
bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient
decoding strategy tailored for the VAR framework. CoDe capitalizes on two
critical observations: the substantially reduced parameter demands at larger
scales and the exclusive generation patterns across different scales. Based on
these insights, we partition the multi-scale inference process into a seamless
collaboration between a large model and a small model. The large model serves
as the 'drafter', specializing in generating low-frequency content at smaller
scales, while the smaller model serves as the 'refiner', solely focusing on
predicting high-frequency details at larger scales. This collaboration yields
remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x
speedup, slashes memory usage by around 50%, and preserves image quality with
only a negligible FID increase from 1.95 to 1.98. When drafting steps are
further decreased, CoDe can achieve an impressive 2.9x acceleration ratio,
reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while
preserving a commendable FID of 2.27. The code is available at
https://github.com/czg1225/CoDeSummary
AI-Generated Summary