ChatPaper.aiChatPaper

LiFT:利用人类反馈实现文本到视频模型对齐

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

December 6, 2024
作者: Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, Hao Li
cs.AI

摘要

最近文本到视频(T2V)生成模型的进展展示出令人印象深刻的能力。然而,这些模型在将合成视频与人类偏好(例如准确反映文本描述)进行对齐方面仍然不足,这一点尤为难以解决,因为人类偏好本质上是主观的,难以形式化为客观函数。因此,本文提出了LiFT,一种利用人类反馈进行T2V模型对齐的新颖微调方法。具体而言,我们首先构建了一个人类评分注释数据集LiFT-HRA,其中包含约10k个人类注释,每个注释包括一个分数及其相应的理由。基于此,我们训练了一个奖励模型LiFT-Critic,有效地学习奖励函数,作为人类判断的代理,衡量给定视频与人类期望之间的对齐程度。最后,我们利用学习到的奖励函数通过最大化奖励加权似然来对齐T2V模型。作为案例研究,我们将我们的流程应用于CogVideoX-2B,结果显示微调后的模型在所有16个指标上均优于CogVideoX-5B,突显了人类反馈在提高合成视频对齐度和质量方面的潜力。
English
Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k human annotations, each including a score and its corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.

Summary

AI-Generated Summary

PDF493December 9, 2024