基于视频的自回归预训练的实证研究
An Empirical Study of Autoregressive Pre-training from Videos
January 9, 2025
作者: Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik
cs.AI
摘要
我们通过实证研究来自视频的自回归预训练。为了进行我们的研究,我们构建了一系列自回归视频模型,称为Toto。我们将视频视为视觉标记序列,并训练变压器模型自回归地预测未来的标记。我们的模型在包含超过1万亿视觉标记的多样化视频和图像数据集上进行了预训练。我们探索了不同的架构、训练和推断设计选择。我们在一系列下游任务上评估了学习到的视觉表示,包括图像识别、视频分类、物体跟踪和机器人技术。我们的结果表明,尽管具有最少的归纳偏差,自回归预训练在所有基准测试中都表现出竞争力。最后,我们发现,扩展我们的视频模型会导致类似于语言模型中所见的扩展曲线,尽管增长速率不同。更多详细信息请参阅https://brjathu.github.io/toto/。
English
We empirically study autoregressive pre-training from videos. To perform our
study, we construct a series of autoregressive video models, called Toto. We
treat videos as sequences of visual tokens and train transformer models to
autoregressively predict future tokens. Our models are pre-trained on a diverse
dataset of videos and images comprising over 1 trillion visual tokens. We
explore different architectural, training, and inference design choices. We
evaluate the learned visual representations on a range of downstream tasks
including image recognition, video classification, object tracking, and
robotics. Our results demonstrate that, despite minimal inductive biases,
autoregressive pre-training leads to competitive performance across all
benchmarks. Finally, we find that scaling our video models results in similar
scaling curves to those seen in language models, albeit with a different rate.
More details at https://brjathu.github.io/toto/Summary
AI-Generated Summary