VisionReward：用于图像和视频生成的细粒度多维人类偏好学习

摘要

我们提出了一种通用策略，用于将视觉生成模型（包括图像和视频生成）与人类偏好对齐。首先，我们构建了VisionReward——一个细粒度且多维的奖励模型。我们将人类对图像和视频的偏好分解为多个维度，每个维度由一系列判断问题表示，线性加权并求和得到一个可解释且准确的分数。为了解决视频质量评估的挑战，我们系统分析了视频的各种动态特征，这有助于VisionReward比VideoScore高出17.2%，并在视频偏好预测方面取得最佳性能。基于VisionReward，我们开发了一种多目标偏好学习算法，有效解决了偏好数据中的混淆因素问题。我们的方法在机器度量和人类评估方面明显优于现有的图像和视频评分方法。所有代码和数据集均可在https://github.com/THUDM/VisionReward获取。

English

We present a general strategy to aligning visual generation models -- both image and video generation -- with human preference. To start with, we build VisionReward -- a fine-grained and multi-dimensional reward model. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score. To address the challenges of video quality assessment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Based on VisionReward, we develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. Our approach significantly outperforms existing image and video scoring methods on both machine metrics and human evaluation. All code and datasets are provided at https://github.com/THUDM/VisionReward.

VisionReward：用于图像和视频生成的细粒度多维人类偏好学习

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

摘要

Summary

Support