InternLM-XComposer2.5-Reward: 간단하면서도 효과적인 다중 모달 보상 모델

초록

대형 비전 언어 모델(LVLMs)의 유망한 성능에도 불구하고, 때로는 잘못된 출력물을 생성할 수 있습니다. 보상 모델(RMs)을 사용한 강화 학습이나 테스트 시간 스케일링은 생성 품질을 향상시킬 수 있는 잠재력을 제공하지만, 중요한 간극이 남아 있습니다: LVLMs를 위한 공개된 멀티모달 RMs가 부족하며, 소유 모델의 구현 세부 사항은 종종 명확하지 않습니다. InternLM-XComposer2.5-Reward (IXC-2.5-Reward)로 이 간극을 메웁니다. 이는 LVLMs를 인간의 선호도와 조화시키는 간단하면서 효과적인 멀티모달 보상 모델입니다. IXC-2.5-Reward의 견고함과 다용성을 보장하기 위해, 다양한 도메인을 아우르는 텍스트, 이미지 및 비디오 입력을 포함하는 고품질 멀티모달 선호 말뭉치를 구축했습니다. 이는 지시 따르기, 일반적 이해, 텍스트 풍부한 문서, 수학적 추론 및 비디오 이해와 같은 다양한 영역을 포함합니다. IXC-2.5-Reward는 최신 멀티모달 보상 모델 벤치마크에서 우수한 결과를 달성하며 텍스트 전용 보상 모델 벤치마크에서 경쟁력 있는 성능을 보여줍니다. 또한 IXC-2.5-Reward의 세 가지 주요 응용 프로그램을 시연합니다: (1) 강화 학습을 위한 감독 신호 제공. Proximal Policy Optimization (PPO)과 통합된 IXC-2.5-Reward는 지시 따르기 및 멀티모달 오픈 엔드 대화에서 일관된 개선을 보여주는 IXC-2.5-Chat을 제공합니다; (2) 테스트 시간 스케일링을 위해 후보 응답 중 최상의 응답 선택; 그리고 (3) 기존 이미지 및 비디오 지시 튜닝 훈련 데이터에서 이상치나 잡음이 많은 샘플 필터링. 재현성을 보장하고 추가 연구를 용이하게 하기 위해, 모든 모델 가중치와 훈련 레시피를 https://github.com/InternLM/InternLM-XComposer 에 오픈 소스로 제공했습니다.

English

Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer

InternLM-XComposer2.5-Reward: 간단하면서도 효과적인 다중 모달 보상 모델

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

초록

Summary

Support