TRecViT:一種循環視頻Transformer
TRecViT: A Recurrent Video Transformer
December 18, 2024
作者: Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu
cs.AI
摘要
我們提出了一種新穎的視頻建模區塊。它依賴於時間-空間-通道分解,每個維度都有專用區塊:閘控線性循環單元(LRUs)在時間上執行信息混合,自注意力層在空間上執行混合,而MLPs在通道上執行。由此產生的架構TRecViT在稀疏和密集任務上表現良好,可以在監督或自監督模式下進行訓練。值得注意的是,我們的模型是因果的,在大規模視頻數據集(SSv2、Kinetics400)上表現優異,優於或與純注意力模型ViViT-L相當,同時參數量少3倍,記憶體佔用量小12倍,和計算量低5倍。代碼和檢查點將在線上提供,網址為https://github.com/google-deepmind/trecvit。
English
We propose a novel block for video modelling. It relies on a
time-space-channel factorisation with dedicated blocks for each dimension:
gated linear recurrent units (LRUs) perform information mixing over time,
self-attention layers perform mixing over space, and MLPs over channels. The
resulting architecture TRecViT performs well on sparse and dense tasks, trained
in supervised or self-supervised regimes. Notably, our model is causal and
outperforms or is on par with a pure attention model ViViT-L on large scale
video datasets (SSv2, Kinetics400), while having 3times less parameters,
12times smaller memory footprint, and 5times lower FLOPs count. Code and
checkpoints will be made available online at
https://github.com/google-deepmind/trecvit.Summary
AI-Generated Summary