視覺中的自回歸模型:一項調查

Autoregressive Models in Vision: A Survey

November 8, 2024
作者: Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong
cs.AI

摘要

自回歸建模在自然語言處理(NLP)領域取得了巨大成功。最近,自回歸模型在計算機視覺領域嶄露頭角,擅長生成高質量的視覺內容。在NLP中,自回歸模型通常操作於子詞元素上。然而,在計算機視覺中,表示策略可以在不同層次上變化,即像素級、標記級或尺度級,反映了視覺數據的多樣性和階層性,與語言的序列結構相比。本調查全面檢視了應用於視覺的自回歸模型文獻。為了提高不同研究背景的研究人員的可讀性,我們從視覺中的序列表示和建模開始。接下來,我們將視覺自回歸模型的基本框架分為三個一般子類別,包括基於像素、基於標記和基於尺度的模型,根據表示策略。然後,我們探索自回歸模型與其他生成模型之間的相互聯繫。此外,我們對計算機視覺中的自回歸模型進行了多方面的分類,包括圖像生成、視頻生成、3D生成和多模態生成。我們還詳細說明了它們在各種領域中的應用,包括新興領域,如具身人工智能和3D醫學人工智能,涉及約250個相關參考文獻。最後,我們強調了自回歸模型在視覺中面臨的當前挑戰,並提出了潛在的研究方向建議。我們還建立了一個Github存儲庫,以組織本調查中包含的論文,網址為:https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey。
English
Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.

Summary

AI-Generated Summary

PDF172November 13, 2024