EgoVid-5M:一個用於自我中心視頻生成的大規模視頻動作數據集
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
November 13, 2024
作者: Xiaofeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Guosheng Zhao, Xiaoyi Bao, Zheng Zhu, Yingya Zhang, Xingang Wang
cs.AI
摘要
影片生成已成為一種有前景的工具,用於世界模擬,利用視覺數據來複製現實世界的環境。在這個背景下,以人類視角為中心的自我中心視頻生成具有顯著潛力,可增強虛擬現實、擴增現實和遊戲應用。然而,自我中心視頻的生成面臨著重大挑戰,這是由於自我中心視點的動態性、行動的複雜多樣性以及遇到的場景的複雜多樣性。現有的數據集無法有效應對這些挑戰。為了彌合這一差距,我們提出了EgoVid-5M,這是專門為自我中心視頻生成而精心策劃的第一個高質量數據集。EgoVid-5M 包含 500 萬個自我中心視頻片段,並附帶詳細的行動標註,包括細粒度的動力學控制和高級文本描述。為確保數據集的完整性和可用性,我們實施了一個複雜的數據清理流程,旨在在自我中心條件下保持幀一致性、行動連貫性和運動平滑度。此外,我們還引入了 EgoDreamer,它能夠同時受行動描述和動力學控制信號驅動生成自我中心視頻。EgoVid-5M 數據集、相關行動標註以及所有數據清理元數據將被釋放,以推動自我中心視頻生成研究的進展。
English
Video generation has emerged as a promising tool for world simulation,
leveraging visual data to replicate real-world environments. Within this
context, egocentric video generation, which centers on the human perspective,
holds significant potential for enhancing applications in virtual reality,
augmented reality, and gaming. However, the generation of egocentric videos
presents substantial challenges due to the dynamic nature of egocentric
viewpoints, the intricate diversity of actions, and the complex variety of
scenes encountered. Existing datasets are inadequate for addressing these
challenges effectively. To bridge this gap, we present EgoVid-5M, the first
high-quality dataset specifically curated for egocentric video generation.
EgoVid-5M encompasses 5 million egocentric video clips and is enriched with
detailed action annotations, including fine-grained kinematic control and
high-level textual descriptions. To ensure the integrity and usability of the
dataset, we implement a sophisticated data cleaning pipeline designed to
maintain frame consistency, action coherence, and motion smoothness under
egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable
of generating egocentric videos driven simultaneously by action descriptions
and kinematic control signals. The EgoVid-5M dataset, associated action
annotations, and all data cleansing metadata will be released for the
advancement of research in egocentric video generation.Summary
AI-Generated Summary