在沒有自然影片的情況下學習影片表示
Learning Video Representations without Natural Videos
October 31, 2024
作者: Xueyang Yu, Xinlei Chen, Yossi Gandelsman
cs.AI
摘要
本文展示了可以從合成影片和自然圖像中學習到有用的影片表示,而無需在訓練中加入自然影片。我們提出了一系列通過簡單生成過程合成的影片數據集,這些數據集模擬了一組不斷增長的自然影片特性(例如運動、加速度和形狀變換)。在這些生成的數據集上預先訓練的影片模型的下游性能隨著數據集進展而逐漸提高。我們在我們的合成影片上預先訓練的 VideoMAE 模型在 UCF101 動作分類中,將從頭開始訓練和自監督預訓練自然影片之間的性能差距縮小了 97.2%,並且在 HMDB51 上優於預先訓練的模型。在預訓練階段引入靜態圖像的裁剪結果表現與 UCF101 預訓練相似,並且在 UCF101-P 的 14 個分布之外的數據集中,有 11 個優於 UCF101 預訓練模型。通過分析數據集的低級特性,我們確定了幀多樣性、幀與自然數據的相似性以及下游性能之間的相關性。我們的方法為預訓練的影片數據策劃過程提供了一種更可控且透明的替代方案。
English
In this paper, we show that useful video representations can be learned from
synthetic videos and natural images, without incorporating natural videos in
the training. We propose a progression of video datasets synthesized by simple
generative processes, that model a growing set of natural video properties
(e.g. motion, acceleration, and shape transformations). The downstream
performance of video models pre-trained on these generated datasets gradually
increases with the dataset progression. A VideoMAE model pre-trained on our
synthetic videos closes 97.2% of the performance gap on UCF101 action
classification between training from scratch and self-supervised pre-training
from natural videos, and outperforms the pre-trained model on HMDB51.
Introducing crops of static images to the pre-training stage results in similar
performance to UCF101 pre-training and outperforms the UCF101 pre-trained model
on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the
low-level properties of the datasets, we identify correlations between frame
diversity, frame similarity to natural data, and downstream performance. Our
approach provides a more controllable and transparent alternative to video data
curation processes for pre-training.Summary
AI-Generated Summary