HART：具混合自回歸Transformer的高效視覺生成

摘要

我們介紹了Hybrid Autoregressive Transformer（HART），這是一種自回歸（AR）視覺生成模型，能夠直接生成1024x1024的圖像，與擴散模型在圖像生成質量上相媲美。現有的AR模型面臨著限制，原因是其離散標記器的圖像重建質量較差，並且生成1024像素圖像的訓練成本過高。為應對這些挑戰，我們提出了混合標記器，將自編碼器中的連續潛在變量分解為兩個部分：代表整體圖像的離散標記和代表無法由離散標記表示的剩餘部分的連續標記。離散部分由可擴展分辨率的離散AR模型建模，而連續部分則通過僅具有3700萬參數的輕量級殘差擴散模塊進行學習。與僅具有離散VAR標記器的方法相比，我們的混合方法將MJHQ-30K上的重建FID從2.11提高到0.30，導致生成FID從7.85提高到5.38，改善了31%。HART在FID和CLIP分數上均優於最先進的擴散模型，並具有4.5-7.7倍的更高吞吐量和6.9-13.4倍的更低MACs。我們的代碼在https://github.com/mit-han-lab/hart上開源。

English

We introduce Hybrid Autoregressive Transformer (HART), an autoregressive (AR) visual generation model capable of directly generating 1024x1024 images, rivaling diffusion models in image generation quality. Existing AR models face limitations due to the poor image reconstruction quality of their discrete tokenizers and the prohibitive training costs associated with generating 1024px images. To address these challenges, we present the hybrid tokenizer, which decomposes the continuous latents from the autoencoder into two components: discrete tokens representing the big picture and continuous tokens representing the residual components that cannot be represented by the discrete tokens. The discrete component is modeled by a scalable-resolution discrete AR model, while the continuous component is learned with a lightweight residual diffusion module with only 37M parameters. Compared with the discrete-only VAR tokenizer, our hybrid approach improves reconstruction FID from 2.11 to 0.30 on MJHQ-30K, leading to a 31% generation FID improvement from 7.85 to 5.38. HART also outperforms state-of-the-art diffusion models in both FID and CLIP score, with 4.5-7.7x higher throughput and 6.9-13.4x lower MACs. Our code is open sourced at https://github.com/mit-han-lab/hart.

HART：具混合自回歸Transformer的高效視覺生成

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

摘要

Summary

Support

Support