ChatPaper.aiChatPaper

利用人类反馈改进视频生成

Improving Video Generation with Human Feedback

January 23, 2025
作者: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang
cs.AI

摘要

通过校正流技术,视频生成取得了显著进展,但问题如不平滑的运动和视频与提示之间的不对齐仍然存在。在这项工作中,我们开发了一个系统化流程,利用人类反馈来缓解这些问题并改进视频生成模型。具体而言,我们首先构建了一个以现代视频生成模型为重点的大规模人类偏好数据集,其中包括跨多个维度的成对注释。然后,我们引入了VideoReward,一个多维视频奖励模型,并研究注释和各种设计选择如何影响其奖励效果。从统一的强化学习角度出发,旨在通过KL正则化最大化奖励,我们通过扩展扩散模型中的算法,引入了三种基于流模型的对齐算法。这些包括两种训练时策略:用于流的直接偏好优化(Flow-DPO)和用于流的奖励加权回归(Flow-RWR),以及一种推理时技术,Flow-NRG,它将奖励指导直接应用于嘈杂的视频。实验结果表明,VideoReward明显优于现有的奖励模型,而Flow-DPO与Flow-RWR和标准监督微调方法相比表现更优。此外,Flow-NRG允许用户在推理过程中为多个目标分配自定义权重,满足个性化视频质量需求。项目页面:https://gongyeliu.github.io/videoalign。
English
Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.

Summary

AI-Generated Summary

PDF504January 24, 2025