ChatPaper.aiChatPaper

Step-Video-T2V技术报告:视频基础模型的实践、挑战与未来展望

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

February 14, 2025
作者: Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
cs.AI

摘要

我们推出Step-Video-T2V,这是一款拥有300亿参数、能够生成长达204帧视频的先进文本到视频预训练模型。为视频生成任务,我们设计了一种深度压缩变分自编码器——Video-VAE,实现了16x16的空间压缩比和8x的时间压缩比,同时保持了卓越的视频重建质量。用户提示通过双语文本编码器处理,支持中英双语输入。采用Flow Matching训练的三维全注意力DiT模型,用于将输入噪声去噪转化为潜在帧。我们还应用了基于视频的DPO方法——Video-DPO,以减少生成视频中的伪影,提升视觉质量。文中详细阐述了训练策略,并分享了关键观察与洞见。Step-Video-T2V的性能在一个新颖的视频生成基准——Step-Video-T2V-Eval上进行了评估,与开源及商业引擎相比,展现了其顶尖的文本到视频生成质量。此外,我们探讨了当前基于扩散模型范式的局限性,并展望了视频基础模型的未来发展方向。Step-Video-T2V及Step-Video-T2V-Eval已发布于https://github.com/stepfun-ai/Step-Video-T2V,在线版本亦可访问https://yuewen.cn/videos。我们的目标是加速视频基础模型的创新,赋能视频内容创作者。
English
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

Summary

AI-Generated Summary

PDF513February 17, 2025