VisionReward:用于图像和视频生成的细粒度多维人类偏好学习
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
December 30, 2024
作者: Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong
cs.AI
摘要
我们提出了一种通用策略,用于将视觉生成模型(包括图像和视频生成)与人类偏好对齐。首先,我们构建了VisionReward——一个细粒度且多维的奖励模型。我们将人类对图像和视频的偏好分解为多个维度,每个维度由一系列判断问题表示,线性加权并求和得到一个可解释且准确的分数。为了解决视频质量评估的挑战,我们系统分析了视频的各种动态特征,这有助于VisionReward比VideoScore高出17.2%,并在视频偏好预测方面取得最佳性能。基于VisionReward,我们开发了一种多目标偏好学习算法,有效解决了偏好数据中的混淆因素问题。我们的方法在机器度量和人类评估方面明显优于现有的图像和视频评分方法。所有代码和数据集均可在https://github.com/THUDM/VisionReward获取。
English
We present a general strategy to aligning visual generation models -- both
image and video generation -- with human preference. To start with, we build
VisionReward -- a fine-grained and multi-dimensional reward model. We decompose
human preferences in images and videos into multiple dimensions, each
represented by a series of judgment questions, linearly weighted and summed to
an interpretable and accurate score. To address the challenges of video quality
assessment, we systematically analyze various dynamic features of videos, which
helps VisionReward surpass VideoScore by 17.2% and achieve top performance for
video preference prediction. Based on VisionReward, we develop a
multi-objective preference learning algorithm that effectively addresses the
issue of confounding factors within preference data. Our approach significantly
outperforms existing image and video scoring methods on both machine metrics
and human evaluation. All code and datasets are provided at
https://github.com/THUDM/VisionReward.Summary
AI-Generated Summary