FiTv2：可擴展且改進的靈活視覺Transformer用於擴散模型

摘要

自然是無限解析度的。在這個現實背景下，現有的擴散模型，如擴散Transformer，在處理超出其訓練領域的圖像解析度時常常面臨挑戰。為了解決這一限制，我們將圖像概念化為具有動態尺寸的令牌序列，而不是將圖像視為固定解析度網格的傳統方法。這種觀點使得可以實現一種靈活的訓練策略，能夠在訓練和推斷過程中無縫地適應各種長寬比，從而促進解析度泛化，消除圖像裁剪引入的偏見。基於這一基礎，我們提出了彈性Vision Transformer（FiT），這是一種專門設計用於生成具有不受限制的解析度和長寬比的圖像的Transformer架構。我們進一步將FiT升級為FiTv2，其中包括幾個創新設計，包括Query-Key向量歸一化、AdaLN-LoRA模塊、矯正流調度器和Logit-Normal取樣器。通過精心調整的網絡結構，FiTv2展現出FiT的2倍收斂速度。當結合先進的無需訓練的外推技術時，FiTv2在解析度外推和多樣解析度生成方面展現出卓越的適應性。此外，我們對FiTv2模型的可擴展性進行了探索，發現較大的模型具有更好的計算效率。此外，我們引入了一種高效的後訓練策略，用於適應預訓練模型進行高解析度生成。全面的實驗證明了FiTv2在各種解析度下的出色性能。我們已在https://github.com/whlzy/FiT 上發布了所有代碼和模型，以促進對於任意解析度圖像生成的擴散Transformer模型的探索。

English

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids. This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping. On this basis, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler. Enhanced by a meticulously adjusted network structure, FiTv2 exhibits 2times convergence speed of FiT. When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation. Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency. Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions. We have released all the codes and models at https://github.com/whlzy/FiT to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.

FiTv2：可擴展且改進的靈活視覺Transformer用於擴散模型

FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

摘要

Summary

Support

Support