LiFT：利用人類反饋來進行文本到視頻模型對齊

摘要

最近在文本到視頻（T2V）生成模型方面取得了顯著進展。然而，這些模型在將合成視頻與人類偏好（例如，準確反映文本描述）對齊方面仍然不足，這特別難以解決，因為人類偏好本質上是主觀的，難以形式化為客觀函數。因此，本文提出了LiFT，一種利用人類反饋進行T2V模型對齊的新型微調方法。具體而言，我們首先構建了一個人類評分標註數據集LiFT-HRA，其中包含約10k個人類標註，每個標註包括一個分數及其相應的理由。基於此，我們訓練了一個獎勵模型LiFT-Critic，有效地學習獎勵函數，作為人類判斷的代理，衡量給定視頻與人類期望之間的對齊情況。最後，我們利用學習到的獎勵函數通過最大化獎勵加權概率來對齊T2V模型。作為案例研究，我們將我們的流程應用於CogVideoX-2B，顯示微調後的模型在所有16個指標上均優於CogVideoX-5B，突顯了人類反饋在改善合成視頻的對齊和質量方面的潛力。

English

Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.

LiFT：利用人類反饋來進行文本到視頻模型對齊

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

摘要

Summary

Support