시각에서의 자기회귀 모델: 조사

초록

자기회귀 모델링은 자연어 처리(NLP) 분야에서 큰 성공을 거두었습니다. 최근에는 자기회귀 모델이 컴퓨터 비전 분야에서 중요한 관심사로 떠오르며 고품질 시각적 콘텐츠를 생성하는 데 뛰어난 성과를 보이고 있습니다. NLP에서의 자기회귀 모델은 일반적으로 서브워드 토큰에서 작동합니다. 그러나 컴퓨터 비전에서의 표현 전략은 픽셀 수준, 토큰 수준 또는 스케일 수준과 같이 다양한 수준에서 다를 수 있으며, 이는 언어의 순차적 구조와 비교하여 시각 데이터의 다양하고 계층적인 특성을 반영합니다. 본 설문은 시각에 적용된 자기회귀 모델에 대한 문헌을 철저히 조사합니다. 다양한 연구 배경을 가진 연구자들을 위해 가독성을 향상시키기 위해 우리는 시각에서의 초기 시퀀스 표현과 모델링부터 시작합니다. 그 다음, 시각 자기회귀 모델의 기본적인 프레임워크를 픽셀 기반, 토큰 기반 및 스케일 기반 모델로 나누어 표현 전략에 기반하여 세 가지 일반 하위 범주로 구분합니다. 그런 다음 자기회귀 모델과 다른 생성 모델 간의 상호 연결을 탐구합니다. 더 나아가 이미지 생성, 비디오 생성, 3D 생성 및 다중 모달 생성을 포함한 컴퓨터 비전에서의 자기회귀 모델의 다양한 면을 제시합니다. 또한, 신체화된 AI 및 3D 의료 AI와 같은 신흥 분야를 포함한 다양한 영역에서의 응용에 대해 상세히 다루며, 관련 참고 자료 약 250편을 제시합니다. 마지막으로, 시각에서의 자기회귀 모델에 대한 현재의 과제를 강조하고 잠재적인 연구 방향에 대한 제안을 제시합니다. 본 설문에 포함된 논문들을 정리하기 위해 Github 저장소를 설정했습니다: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.

English

Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.

시각에서의 자기회귀 모델: 조사

Autoregressive Models in Vision: A Survey

초록

Support