비디오로부터의 자기 회귀적 사전 훈련에 대한 경험적 연구

초록

우리는 비디오로부터의 자기회귀 사전 훈련을 경험적으로 연구했습니다. 연구를 수행하기 위해 Toto라고 불리는 일련의 자기회귀 비디오 모델을 구축했습니다. 우리는 비디오를 시각 토큰의 시퀀스로 취급하고 트랜스포머 모델을 훈련하여 미래 토큰을 자기회귀적으로 예측하도록 합니다. 우리의 모델은 1조 개 이상의 시각 토큰으로 구성된 다양한 데이터셋에서 사전 훈련되었습니다. 우리는 다양한 구조, 훈련 및 추론 설계 선택지를 탐구했습니다. 우리는 이미지 인식, 비디오 분류, 물체 추적 및 로봇 과제를 포함한 다양한 하향 작업에서 학습된 시각적 표현을 평가했습니다. 우리의 결과는 최소한의 귀납 편향에도 불구하고, 자기회귀 사전 훈련이 모든 벤치마크에서 경쟁력 있는 성능을 보여준다는 것을 입증합니다. 마지막으로, 비디오 모델의 스케일링은 언어 모델에서 본 것과 유사한 스케일링 곡선을 보여주지만, 다른 속도로 나타납니다. 더 많은 세부 정보는 https://brjathu.github.io/toto/에서 확인할 수 있습니다.

English

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/

비디오로부터의 자기 회귀적 사전 훈련에 대한 경험적 연구

An Empirical Study of Autoregressive Pre-training from Videos

초록

Support