邁向普遍足球影片理解

摘要

作為一項全球受歡迎的運動，足球吸引了來自世界各地球迷的廣泛興趣。本文旨在開發一個全面的多模態框架，用於足球視頻理解。具體而言，本文在以下方面做出貢獻：(i) 我們介紹了迄今為止最大的多模態足球數據集 SoccerReplay-1988，其中包含來自 1,988 場完整比賽的視頻和詳細注釋，並配備自動化注釋流程；(ii) 我們提出了足球領域的第一個視覺語言基礎模型 MatchVision，該模型利用足球視頻中的時空信息，在各種下游任務中表現出色；(iii) 我們對事件分類、評論生成和多視角犯規識別進行了廣泛實驗和消融研究。MatchVision 在所有這些任務上展現出最先進的性能，顯著優於現有模型，突顯了我們提出的數據和模型的優越性。我們相信這項工作將為體育理解研究提供一個標準範式。

English

As a globally celebrated sport, soccer has attracted widespread interest from fans all over the world. This paper aims to develop a comprehensive multi-modal framework for soccer video understanding. Specifically, we make the following contributions in this paper: (i) we introduce SoccerReplay-1988, the largest multi-modal soccer dataset to date, featuring videos and detailed annotations from 1,988 complete matches, with an automated annotation pipeline; (ii) we present the first visual-language foundation model in the soccer domain, MatchVision, which leverages spatiotemporal information across soccer videos and excels in various downstream tasks; (iii) we conduct extensive experiments and ablation studies on event classification, commentary generation, and multi-view foul recognition. MatchVision demonstrates state-of-the-art performance on all of them, substantially outperforming existing models, which highlights the superiority of our proposed data and model. We believe that this work will offer a standard paradigm for sports understanding research.

邁向普遍足球影片理解

Towards Universal Soccer Video Understanding

摘要

Support