TRecViT：一种循环视频Transformer

摘要

我们提出了一种用于视频建模的新型模块。它依赖于时间-空间-通道分解，针对每个维度都有专用模块：门控线性循环单元（LRUs）在时间上执行信息混合，自注意力层在空间上执行混合，MLPs在通道上执行操作。由此产生的架构TRecViT在稀疏和密集任务上表现良好，可以在监督或自监督模式下训练。值得注意的是，我们的模型是因果的，在大规模视频数据集（SSv2、Kinetics400）上表现优异，优于或与纯注意力模型ViViT-L相当，同时参数数量少3倍，内存占用小12倍，FLOPs计数低5倍。代码和检查点将在以下网址上提供：https://github.com/google-deepmind/trecvit。

English

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having 3times less parameters, 12times smaller memory footprint, and 5times lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.

TRecViT：一种循环视频Transformer

TRecViT: A Recurrent Video Transformer

摘要

Support