在没有自然视频的情况下学习视频表示形式

摘要

本文展示了可以从合成视频和自然图像中学习到有用的视频表示，而无需在训练中加入自然视频。我们提出了一系列通过简单生成过程合成的视频数据集，这些数据集模拟了一系列自然视频属性（如运动、加速度和形状变换）的增长。在这些生成数据集上预训练的视频模型的下游性能随着数据集的进展逐渐提高。在我们的合成视频上预训练的VideoMAE模型在UCF101动作分类中，将从头开始训练和自监督预训练自然视频之间的性能差距缩小了97.2%，并且在HMDB51上胜过了预训练模型。在预训练阶段引入静态图像的裁剪结果表现类似于UCF101预训练，并且在UCF101-P的14个分布之外的数据集中有11个胜过了UCF101预训练模型。通过分析数据集的低级属性，我们确定了帧多样性、帧与自然数据的相似性以及下游性能之间的相关性。我们的方法为视频数据的预训练提供了一个更可控和透明的替代方案，而不需要进行数据筛选过程。

English

In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g. motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.

在没有自然视频的情况下学习视频表示形式

Learning Video Representations without Natural Videos

摘要

Summary

Support

Support