TRecViT: 순환 비디오 트랜스포머

초록

우리는 비디오 모델링을 위한 새로운 블록을 제안합니다. 이는 시간-공간-채널 분해를 기반으로 하며 각 차원에 대한 전용 블록을 활용합니다: 게이트된 선형 순환 유닛(LRU)은 시간에 걸쳐 정보를 혼합하고, 셀프 어텐션 레이어는 공간에서 혼합을 수행하며, MLP는 채널에서 작동합니다. 이러한 아키텍처 TRecViT은 희소 및 밀도 있는 작업에 대해 우수한 성능을 발휘하며, 지도 또는 자가 지도 규제로 훈련됩니다. 특히, 우리의 모델은 인과적이며 대규모 비디오 데이터셋(SSv2, Kinetics400)에서 순수 어텐션 모델 ViViT-L보다 우수한 성과를 보이거나 비슷한 수준입니다. 동시에 매개변수가 3배 적고, 메모리 풋프린트가 12배 작으며, FLOPs 카운트가 5배 낮습니다. 코드 및 체크포인트는 https://github.com/google-deepmind/trecvit에서 온라인으로 제공될 예정입니다.

English

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having 3times less parameters, 12times smaller memory footprint, and 5times lower FLOPs count. Code and checkpoints will be made available online at https://github.com/google-deepmind/trecvit.

TRecViT: 순환 비디오 트랜스포머

TRecViT: A Recurrent Video Transformer

초록

Summary

Support

Support