InternLM-XComposer2.5-Reward:一个简单而有效的多模态奖励模型
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
January 21, 2025
作者: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang
cs.AI
摘要
尽管大型视觉语言模型(LVLMs)在视觉理解方面表现出色,但有时会生成不正确的输出。虽然基于强化学习或测试时缩放的奖励模型(RMs)提供了提高生成质量的潜力,但仍存在一个关键差距:公开可用的LVLMs多模态奖励模型稀缺,专有模型的实现细节通常不清楚。我们通过InternLM-XComposer2.5-Reward(IXC-2.5-Reward)填补了这一差距,这是一个简单而有效的多模态奖励模型,可使LVLMs与人类偏好保持一致。为确保IXC-2.5-Reward的稳健性和多功能性,我们建立了一个高质量的多模态偏好语料库,涵盖文本、图像和视频输入,跨越不同领域,如遵循指示、一般理解、文本丰富的文档、数学推理和视频理解。IXC-2.5-Reward在最新的多模态奖励模型基准测试中取得了出色的结果,并在仅文本奖励模型基准测试中表现出竞争力。我们进一步展示了IXC-2.5-Reward的三个关键应用:(1)为RL训练提供监督信号。我们将IXC-2.5-Reward与Proximal Policy Optimization(PPO)集成,产生了IXC-2.5-Chat,显示出在遵循指示和多模态开放式对话中持续改进;(2)从候选响应中选择最佳响应以进行测试时缩放;以及(3)从现有图像和视频指示调整训练数据中过滤异常值或嘈杂样本。为确保可复现性并促进进一步研究,我们已在https://github.com/InternLM/InternLM-XComposer开源了所有模型权重和训练配方。
English
Despite the promising performance of Large Vision Language Models (LVLMs) in
visual understanding, they occasionally generate incorrect outputs. While
reward models (RMs) with reinforcement learning or test-time scaling offer the
potential for improving generation quality, a critical gap remains: publicly
available multi-modal RMs for LVLMs are scarce, and the implementation details
of proprietary models are often unclear. We bridge this gap with
InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective
multi-modal reward model that aligns LVLMs with human preferences. To ensure
the robustness and versatility of IXC-2.5-Reward, we set up a high-quality
multi-modal preference corpus spanning text, image, and video inputs across
diverse domains, such as instruction following, general understanding,
text-rich documents, mathematical reasoning, and video understanding.
IXC-2.5-Reward achieves excellent results on the latest multi-modal reward
model benchmark and shows competitive performance on text-only reward model
benchmarks. We further demonstrate three key applications of IXC-2.5-Reward:
(1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward
with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows
consistent improvements in instruction following and multi-modal open-ended
dialogue; (2) Selecting the best response from candidate responses for
test-time scaling; and (3) Filtering outlier or noisy samples from existing
image and video instruction tuning training data. To ensure reproducibility and
facilitate further research, we have open-sourced all model weights and
training recipes at https://github.com/InternLM/InternLM-XComposerSummary
AI-Generated Summary