VisionReward:用於影像和視頻生成的精細多維人類偏好學習
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
December 30, 2024
作者: Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, Jiayan Teng, Zhuoyi Yang, Wendi Zheng, Xiao Liu, Ming Ding, Xiaohan Zhang, Xiaotao Gu, Shiyu Huang, Minlie Huang, Jie Tang, Yuxiao Dong
cs.AI
摘要
我們提出了一般性策略,用於將視覺生成模型(包括圖像和影片生成)與人類偏好對齊。首先,我們建立了VisionReward ─ 一個精細且多維度的獎勵模型。我們將人類對圖像和影片的偏好分解為多個維度,每個維度由一系列判斷問題表示,經線性加權並總結為一個可解釋且準確的分數。為應對影片質量評估的挑戰,我們系統地分析了影片的各種動態特徵,這有助於VisionReward超越VideoScore 17.2%,並實現頂尖的影片偏好預測表現。基於VisionReward,我們開發了一種多目標偏好學習算法,有效解決了偏好數據中的混淆因素問題。我們的方法在機器指標和人類評估方面明顯優於現有的圖像和影片評分方法。所有代碼和數據集均可在https://github.com/THUDM/VisionReward找到。
English
We present a general strategy to aligning visual generation models -- both
image and video generation -- with human preference. To start with, we build
VisionReward -- a fine-grained and multi-dimensional reward model. We decompose
human preferences in images and videos into multiple dimensions, each
represented by a series of judgment questions, linearly weighted and summed to
an interpretable and accurate score. To address the challenges of video quality
assessment, we systematically analyze various dynamic features of videos, which
helps VisionReward surpass VideoScore by 17.2% and achieve top performance for
video preference prediction. Based on VisionReward, we develop a
multi-objective preference learning algorithm that effectively addresses the
issue of confounding factors within preference data. Our approach significantly
outperforms existing image and video scoring methods on both machine metrics
and human evaluation. All code and datasets are provided at
https://github.com/THUDM/VisionReward.Summary
AI-Generated Summary