ChatPaper.aiChatPaper

视觉中的自回归模型:一项调查

Autoregressive Models in Vision: A Survey

November 8, 2024
作者: Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong
cs.AI

摘要

自回归建模在自然语言处理(NLP)领域取得了巨大成功。最近,自回归模型已经成为计算机视觉中一个重要的研究领域,它在生成高质量视觉内容方面表现出色。NLP中的自回归模型通常操作于子词标记。然而,在计算机视觉中,表示策略可以在不同层次上变化,即像素级、标记级或尺度级,反映了视觉数据的多样性和分层性质,与语言的序列结构相比。本调查全面审视了应用于视觉的自回归模型文献。为了提高来自不同研究背景的研究人员的可读性,我们从视觉中的序列表示和建模开始。接下来,我们将视觉自回归模型的基本框架分为三个一般子类别,包括基于像素、基于标记和基于尺度的模型,根据表示策略。然后,我们探讨自回归模型与其他生成模型之间的相互关系。此外,我们提出了计算机视觉中自回归模型的多方面分类,包括图像生成、视频生成、3D生成和多模态生成。我们还详细阐述了它们在各种领域中的应用,包括新兴领域,如具身人工智能和3D医疗人工智能,涉及约250个相关参考文献。最后,我们强调了视觉中自回归模型面临的当前挑战,并提出了潜在的研究方向建议。我们还建立了一个Github存储库,以整理本调查中包含的论文:https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey。
English
Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.

Summary

AI-Generated Summary

PDF182November 13, 2024