MM-RLHF:迈向多模态大语言模型对齐的新台阶
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
February 14, 2025
作者: Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan
cs.AI
摘要
尽管多模态大语言模型(MLLMs)已取得显著进展,但大多数顶尖模型尚未与人类偏好进行充分对齐。这一差距的存在,是因为当前的对齐研究主要在特定领域(如减少幻觉)取得了进展,而关于对齐模型与人类偏好是否能系统性提升MLLM能力这一更广泛的问题,仍很大程度上未被探索。为此,我们推出了MM-RLHF数据集,包含12万条细粒度的人工标注偏好对比对。该数据集在规模、多样性、标注粒度及质量上均显著超越现有资源,标志着重大进步。基于此数据集,我们提出了多项关键创新,旨在提升奖励模型的质量及对齐算法的效率。特别值得一提的是,我们引入了基于评论的奖励模型,该模型在评分前先对模型输出进行评论,相较于传统的标量奖励机制,提供了更强的可解释性和信息反馈。此外,我们提出了动态奖励缩放方法,根据奖励信号调整每个样本的损失权重,从而优化高质量对比对的使用。我们的方法在10个不同维度和27个基准测试中进行了严格评估,结果显示模型性能得到了显著且一致的提升。具体而言,使用MM-RLHF及我们的对齐算法微调LLaVA-ov-7B,使对话能力提升了19.5%,安全性提高了60%。
我们已开源偏好数据集、奖励模型、训练与评估代码,以及奖励建模与安全基准。更多详情,请访问我们的项目页面:https://mm-rlhf.github.io。
English
Despite notable advancements in Multimodal Large Language Models (MLLMs),
most state-of-the-art models have not undergone thorough alignment with human
preferences. This gap exists because current alignment research has primarily
achieved progress in specific areas (e.g., hallucination reduction), while the
broader question of whether aligning models with human preferences can
systematically enhance MLLM capability remains largely unexplored. To this end,
we introduce MM-RLHF, a dataset containing 120k fine-grained,
human-annotated preference comparison pairs. This dataset represents a
substantial advancement over existing resources, offering superior size,
diversity, annotation granularity, and quality. Leveraging this dataset, we
propose several key innovations to improve both the quality of reward models
and the efficiency of alignment algorithms. Notably, we introduce a
Critique-Based Reward Model, which generates critiques of model outputs before
assigning scores, offering enhanced interpretability and more informative
feedback compared to traditional scalar reward mechanisms. Additionally, we
propose Dynamic Reward Scaling, a method that adjusts the loss weight of each
sample according to the reward signal, thereby optimizing the use of
high-quality comparison pairs. Our approach is rigorously evaluated across
10 distinct dimensions and 27 benchmarks, with results
demonstrating significant and consistent improvements in model performance.
Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm
leads to a 19.5% increase in conversational abilities and a
60% improvement in safety.
We have open-sourced the preference dataset, reward model, training and
evaluation code, as well as reward modeling and safety benchmarks. For more
details, please visit our project page: https://mm-rlhf.github.io.Summary
AI-Generated Summary