提取語義感知順序以實現自回歸圖像生成
Distilling semantically aware orders for autoregressive image generation
April 23, 2025
作者: Rishav Pramanik, Antoine Poupon, Juan A. Rodriguez, Masih Aminbeidokhti, David Vazquez, Christopher Pal, Zhaozheng Yin, Marco Pedersoli
cs.AI
摘要
基於自迴歸的圖像塊生成方法近期在圖像質量和可擴展性方面展現了競爭力。它也能夠輕鬆地集成並擴展到視覺-語言模型中。然而,自迴歸模型需要為圖像塊生成定義一個順序。雖然基於文字順序的自然排列對於文本生成是合理的,但圖像生成並不存在固有的生成順序。傳統上,自迴歸圖像生成模型遵循光柵掃描順序(從左上到右下)。本文中,我們認為這種順序並非最優,因為它未能尊重圖像內容的因果關係:例如,當基於日落的視覺描述進行條件生成時,自迴歸模型可能會在生成太陽之前生成雲朵,儘管雲朵的顏色應取決於太陽的顏色而非相反。在本研究中,我們首先通過訓練一個模型以任意給定順序生成圖像塊,從而能在生成過程中推斷每個圖像塊的內容和位置(順序)。其次,我們利用這些提取的順序對任意順序生成模型進行微調,以產生更高質量的圖像。通過實驗,我們在兩個數據集上證明了這種新的生成方法相比傳統的光柵掃描方法能生成更好的圖像,且訓練成本相似,無需額外標註。
English
Autoregressive patch-based image generation has recently shown competitive
results in terms of image quality and scalability. It can also be easily
integrated and scaled within Vision-Language models. Nevertheless,
autoregressive models require a defined order for patch generation. While a
natural order based on the dictation of the words makes sense for text
generation, there is no inherent generation order that exists for image
generation. Traditionally, a raster-scan order (from top-left to bottom-right)
guides autoregressive image generation models. In this paper, we argue that
this order is suboptimal, as it fails to respect the causality of the image
content: for instance, when conditioned on a visual description of a sunset, an
autoregressive model may generate clouds before the sun, even though the color
of clouds should depend on the color of the sun and not the inverse. In this
work, we show that first by training a model to generate patches in
any-given-order, we can infer both the content and the location (order) of each
patch during generation. Secondly, we use these extracted orders to finetune
the any-given-order model to produce better-quality images. Through our
experiments, we show on two datasets that this new generation method produces
better images than the traditional raster-scan approach, with similar training
costs and no extra annotations.Summary
AI-Generated Summary