邁向普遍足球影片理解

Towards Universal Soccer Video Understanding

December 2, 2024
作者: Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang, Weidi Xie
cs.AI

摘要

作為一項全球受歡迎的運動,足球吸引了來自世界各地球迷的廣泛興趣。本文旨在開發一個全面的多模態框架,用於足球視頻理解。具體而言,本文在以下方面做出貢獻:(i) 我們介紹了迄今為止最大的多模態足球數據集 SoccerReplay-1988,其中包含來自 1,988 場完整比賽的視頻和詳細注釋,並配備自動化注釋流程;(ii) 我們提出了足球領域的第一個視覺語言基礎模型 MatchVision,該模型利用足球視頻中的時空信息,在各種下游任務中表現出色;(iii) 我們對事件分類、評論生成和多視角犯規識別進行了廣泛實驗和消融研究。MatchVision 在所有這些任務上展現出最先進的性能,顯著優於現有模型,突顯了我們提出的數據和模型的優越性。我們相信這項工作將為體育理解研究提供一個標準範式。
English
As a globally celebrated sport, soccer has attracted widespread interest from fans all over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding. Specifically, we make the following contributions in this paper: (i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline; (ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks; (iii) we conduct extensive experiments and ablation studies on event classification, commentary generation, and multi-view foul recognition. MatchVision demonstrates state-of-the-art performance on all of them, substantially outperforming existing models, which highlights the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research.

Summary

AI-Generated Summary

PDF92December 6, 2024