VisionReward：用於影像和視頻生成的精細多維人類偏好學習

摘要

我們提出了一般性策略，用於將視覺生成模型（包括圖像和影片生成）與人類偏好對齊。首先，我們建立了VisionReward ─ 一個精細且多維度的獎勵模型。我們將人類對圖像和影片的偏好分解為多個維度，每個維度由一系列判斷問題表示，經線性加權並總結為一個可解釋且準確的分數。為應對影片質量評估的挑戰，我們系統地分析了影片的各種動態特徵，這有助於VisionReward超越VideoScore 17.2％，並實現頂尖的影片偏好預測表現。基於VisionReward，我們開發了一種多目標偏好學習算法，有效解決了偏好數據中的混淆因素問題。我們的方法在機器指標和人類評估方面明顯優於現有的圖像和影片評分方法。所有代碼和數據集均可在https://github.com/THUDM/VisionReward找到。

English

We present a general strategy to aligning visual generation models -- both image and video generation -- with human preference. To start with, we build VisionReward -- a fine-grained and multi-dimensional reward model. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score. To address the challenges of video quality assessment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Based on VisionReward, we develop a multi-objective preference learning algorithm that effectively addresses the issue of confounding factors within preference data. Our approach significantly outperforms existing image and video scoring methods on both machine metrics and human evaluation. All code and datasets are provided at https://github.com/THUDM/VisionReward.

VisionReward：用於影像和視頻生成的精細多維人類偏好學習

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

摘要

Support