EgoVid-5M：一個用於自我中心視頻生成的大規模視頻動作數據集

摘要

影片生成已成為一種有前景的工具，用於世界模擬，利用視覺數據來複製現實世界的環境。在這個背景下，以人類視角為中心的自我中心視頻生成具有顯著潛力，可增強虛擬現實、擴增現實和遊戲應用。然而，自我中心視頻的生成面臨著重大挑戰，這是由於自我中心視點的動態性、行動的複雜多樣性以及遇到的場景的複雜多樣性。現有的數據集無法有效應對這些挑戰。為了彌合這一差距，我們提出了EgoVid-5M，這是專門為自我中心視頻生成而精心策劃的第一個高質量數據集。EgoVid-5M 包含 500 萬個自我中心視頻片段，並附帶詳細的行動標註，包括細粒度的動力學控制和高級文本描述。為確保數據集的完整性和可用性，我們實施了一個複雜的數據清理流程，旨在在自我中心條件下保持幀一致性、行動連貫性和運動平滑度。此外，我們還引入了 EgoDreamer，它能夠同時受行動描述和動力學控制信號驅動生成自我中心視頻。EgoVid-5M 數據集、相關行動標註以及所有數據清理元數據將被釋放，以推動自我中心視頻生成研究的進展。

English

Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments. Within this context, egocentric video generation, which centers on the human perspective, holds significant potential for enhancing applications in virtual reality, augmented reality, and gaming. However, the generation of egocentric videos presents substantial challenges due to the dynamic nature of egocentric viewpoints, the intricate diversity of actions, and the complex variety of scenes encountered. Existing datasets are inadequate for addressing these challenges effectively. To bridge this gap, we present EgoVid-5M, the first high-quality dataset specifically curated for egocentric video generation. EgoVid-5M encompasses 5 million egocentric video clips and is enriched with detailed action annotations, including fine-grained kinematic control and high-level textual descriptions. To ensure the integrity and usability of the dataset, we implement a sophisticated data cleaning pipeline designed to maintain frame consistency, action coherence, and motion smoothness under egocentric conditions. Furthermore, we introduce EgoDreamer, which is capable of generating egocentric videos driven simultaneously by action descriptions and kinematic control signals. The EgoVid-5M dataset, associated action annotations, and all data cleansing metadata will be released for the advancement of research in egocentric video generation.

EgoVid-5M：一個用於自我中心視頻生成的大規模視頻動作數據集

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

摘要

Support