협력적 디코딩은 시각적 자기회귀 모델링을 효율적으로 만듭니다.

초록

이미지 생성 분야에서 빠르게 발전하는 가운데, 시각적 자기 회귀(Visual Auto-Regressive, VAR) 모델링은 혁신적인 다음 단계 예측 접근 방식으로 큰 주목을 받고 있습니다. 이 패러다임은 효율성, 확장성, 그리고 제로샷 일반화에서 상당한 개선을 제공합니다. 그러나 VAR의 본질적으로 굵고 미세한 특성은 긴 토큰 시퀀스를 도입하여 메모리 소비와 계산 중복을 방해합니다. 이러한 병목 현상을 해결하기 위해, 우리는 VAR 프레임워크에 맞춘 혁신적인 효율적 디코딩 전략인 협력 디코딩(Collaborative Decoding, CoDe)을 제안합니다. CoDe는 두 가지 중요한 관찰에 기반을 두고 있습니다: 큰 규모에서 크게 줄어든 매개변수 요구와 서로 다른 규모에서의 독점적 생성 패턴. 이러한 통찰력을 기반으로 다중 규모 추론 과정을 큰 모델과 작은 모델 간의 원활한 협력으로 분할합니다. 큰 모델은 작은 규모에서 낮은 주파수 콘텐츠를 생성하는 '작성자'로 작용하고, 작은 모델은 큰 규모에서 고주파수 세부 정보를 예측하는 '정제자'로만 집중합니다. 이 협력은 탁월한 효율성을 제공하면서 품질에 미미한 영향을 미칩니다: CoDe는 1.7배의 가속화를 달성하고, 메모리 사용량을 약 50% 줄이며, 이미지 품질을 1.95에서 1.98로 무시할 수 있는 FID 증가만으로 유지합니다. 작성 단계가 더욱 줄어들면, CoDe는 놀라운 2.9배의 가속 비율을 달성할 수 있으며, NVIDIA 4090 GPU 하나로 256x256 해상도에서 초당 41개의 이미지를 생성하면서 2.27의 훌륭한 FID를 유지합니다. 코드는 https://github.com/czg1225/CoDe에서 제공됩니다.

English

In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at https://github.com/czg1225/CoDe

협력적 디코딩은 시각적 자기회귀 모델링을 효율적으로 만듭니다.

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

초록

Support