VLRewardBench:一个为视觉-语言生成奖励模型设计的具有挑战性的基准测试
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
November 26, 2024
作者: Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu
cs.AI
摘要
视觉-语言生成奖励模型(VL-GenRMs)在对齐和评估多模态人工智能系统中发挥着关键作用,然而它们自身的评估仍未得到充分探讨。当前的评估方法主要依赖于传统视觉-语言任务中的人工智能注释偏好标签,这可能引入偏见并且通常无法有效挑战最先进的模型。为了解决这些局限,我们引入了VL-RewardBench,这是一个全面的基准测试,涵盖了一般多模态查询、视觉幻觉检测和复杂推理任务。通过我们的人工智能辅助注释流程,结合样本选择和人工验证,我们精心策划了1,250个高质量示例,专门设计用于探究模型的局限性。对16个领先的大规模视觉-语言模型进行全面评估,证明了VL-RewardBench作为一个具有挑战性的测试平台的有效性,即使是GPT-4o也仅实现了65.4%的准确率,而诸如Qwen2-VL-72B等最先进的开源模型,也难以超越随机猜测。重要的是,在VL-RewardBench上的表现与使用VL-GenRMs的Best-of-N采样的MMMU-Pro准确率强烈相关(皮尔逊相关系数r > 0.9)。分析实验揭示了改进VL-GenRMs的三个关键见解:(i)模型主要在基本视觉感知任务上失败,而不是推理任务;(ii)推理时间的扩展效益根据模型容量差异巨大;(iii)训练VL-GenRMs学会判断显著提升了判断能力(对于一个7B VL-GenRM,准确率提高了14.7%)。我们相信VL-RewardBench以及实验见解将成为推进VL-GenRMs的宝贵资源。
English
Vision-language generative reward models (VL-GenRMs) play a crucial role in
aligning and evaluating multimodal AI systems, yet their own evaluation remains
under-explored. Current assessment methods primarily rely on AI-annotated
preference labels from traditional VL tasks, which can introduce biases and
often fail to effectively challenge state-of-the-art models. To address these
limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning
general multimodal queries, visual hallucination detection, and complex
reasoning tasks. Through our AI-assisted annotation pipeline combining sample
selection with human verification, we curate 1,250 high-quality examples
specifically designed to probe model limitations. Comprehensive evaluation
across 16 leading large vision-language models, demonstrates VL-RewardBench's
effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4%
accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B,
struggle to surpass random-guessing. Importantly, performance on VL-RewardBench
strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N
sampling with VL-GenRMs. Analysis experiments uncover three critical insights
for improving VL-GenRMs: (i) models predominantly fail at basic visual
perception tasks rather than reasoning tasks; (ii) inference-time scaling
benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to
learn to judge substantially boosts judgment capability (+14.7% accuracy for a
7B VL-GenRM). We believe VL-RewardBench along with the experimental insights
will become a valuable resource for advancing VL-GenRMs.Summary
AI-Generated Summary