ビデオからの自己回帰事前学習の実証的研究

要旨

ビデオからの自己回帰的事前学習を実証的に研究します。研究を行うために、Totoと呼ばれる一連の自己回帰的ビデオモデルを構築します。ビデオを視覚トークンの系列として扱い、トランスフォーマーモデルを訓練して将来のトークンを自己回帰的に予測します。当社のモデルは、1兆以上の視覚トークンから成る多様なデータセットで事前学習されています。異なるアーキテクチャ、トレーニング、推論デザインの選択肢を探ります。画像認識、ビデオ分類、物体追跡、ロボティクスを含むさまざまな下流タスクで学習された視覚表現を評価します。結果は、最小限の帰紵バイアスにもかかわらず、自己回帰的事前学習がすべてのベンチマークで競争力のあるパフォーマンスをもたらすことを示しています。最後に、ビデオモデルをスケーリングすると、言語モデルで見られるスケーリング曲線と同様の結果が得られることがわかりますが、異なる速度で変化します。詳細はhttps://brjathu.github.io/toto/にて。

English

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at https://brjathu.github.io/toto/

ビデオからの自己回帰事前学習の実証的研究

An Empirical Study of Autoregressive Pre-training from Videos

要旨

Summary

Support