ChatPaper.aiChatPaper

流暢:通過連續標記擴展自回歸文本到圖像生成模型

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

October 17, 2024
作者: Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian
cs.AI

摘要

在視覺領域中,擴展自回歸模型並未像大型語言模型那樣帶來顯著好處。本研究探討了這個擴展問題,專注於文本到圖像生成的情境,著重於兩個關鍵因素:模型是否使用離散或連續的標記,以及標記是以隨機還是固定的光柵順序生成,並使用BERT或GPT等變壓器架構。我們的實證結果顯示,儘管所有模型在驗證損失方面有效擴展,但它們的評估表現──以FID、GenEval分數和視覺質量來衡量──呈現不同趨勢。基於連續標記的模型在視覺質量上顯著優於使用離散標記的模型。此外,生成順序和注意機制對GenEval分數有顯著影響:隨機順序模型的GenEval分數明顯優於光柵順序模型。受到這些發現的啟發,我們訓練了Fluid,這是一個基於連續標記的隨機順序自回歸模型。Fluid 10.5B模型在MS-COCO 30K上實現了新的零樣本FID最佳值為6.16,並在GenEval基準測試中獲得了0.69的總分。我們希望我們的發現和結果能夠鼓勵未來努力進一步彌合視覺和語言模型之間的擴展差距。
English
Scaling up autoregressive models in vision has not proven as beneficial as in large language models. In this work, we investigate this scaling problem in the context of text-to-image generation, focusing on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed raster order using BERT- or GPT-like transformer architectures. Our empirical results show that, while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends. Models based on continuous tokens achieve significantly better visual quality than those using discrete tokens. Furthermore, the generation order and attention mechanisms significantly affect the GenEval score: random-order models achieve notably better GenEval scores compared to raster-order models. Inspired by these findings, we train Fluid, a random-order autoregressive model on continuous tokens. Fluid 10.5B model achieves a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K, and 0.69 overall score on the GenEval benchmark. We hope our findings and results will encourage future efforts to further bridge the scaling gap between vision and language models.

Summary

AI-Generated Summary

PDF383November 16, 2024