從影片中進行自回歸預訓練的實證研究
An Empirical Study of Autoregressive Pre-training from Videos
January 9, 2025
作者: Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik
cs.AI
摘要
我們從實證角度研究了從影片中進行自回歸預訓練。為了進行我們的研究,我們構建了一系列自回歸影片模型,名為 Toto。我們將影片視為視覺標記的序列,並訓練變壓器模型以自回歸方式預測未來的標記。我們的模型在包含超過 1 兆視覺標記的多樣數據集上進行了預訓練。我們探索了不同的架構、訓練和推理設計選擇。我們在一系列下游任務上評估了所學習的視覺表示,包括圖像識別、影片分類、物體追蹤和機器人技術。我們的結果表明,儘管具有最少的歸納偏差,自回歸預訓練在所有基準測試中都具有競爭力的表現。最後,我們發現,對影片模型進行規模化會導致類似於語言模型中所見的規模化曲線,儘管速率不同。更多詳細信息請參閱 https://brjathu.github.io/toto/
English
We empirically study autoregressive pre-training from videos. To perform our
study, we construct a series of autoregressive video models, called Toto. We
treat videos as sequences of visual tokens and train transformer models to
autoregressively predict future tokens. Our models are pre-trained on a diverse
dataset of videos and images comprising over 1 trillion visual tokens. We
explore different architectural, training, and inference design choices. We
evaluate the learned visual representations on a range of downstream tasks
including image recognition, video classification, object tracking, and
robotics. Our results demonstrate that, despite minimal inductive biases,
autoregressive pre-training leads to competitive performance across all
benchmarks. Finally, we find that scaling our video models results in similar
scaling curves to those seen in language models, albeit with a different rate.
More details at https://brjathu.github.io/toto/Summary
AI-Generated Summary