VLRewardBench:一個針對視覺-語言生成獎勵模型的具挑戰性基準測試
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
November 26, 2024
作者: Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu
cs.AI
摘要
視覺語言生成獎勵模型(VL-GenRMs)在對齊和評估多模式人工智慧系統中扮演著關鍵角色,然而它們自身的評估仍未被充分探討。目前的評估方法主要依賴於傳統視覺語言任務中的人工智慧標註偏好標籤,這可能引入偏見並且常常無法有效挑戰最先進的模型。為了應對這些限制,我們引入了VL-RewardBench,這是一個全面的基準測試,涵蓋了一般多模式查詢、視覺幻覺檢測和複雜推理任務。通過我們的人工智慧輔助標註流程,結合樣本選擇和人工驗證,我們精心挑選了1,250個高質量範例,專門設計來探測模型的局限性。對16個領先的大型視覺語言模型進行全面評估,顯示VL-RewardBench作為一個具有挑戰性的測試平臺的有效性,即使是GPT-4o也僅達到65.4%的準確率,而像Qwen2-VL-72B這樣的最先進開源模型,難以超越隨機猜測。重要的是,VL-RewardBench上的表現與使用VL-GenRMs的Best-of-N採樣的MMMU-Pro準確度呈現強烈相關性(皮爾森r > 0.9)。分析實驗揭示了三個關鍵見解,有助於改進VL-GenRMs:(i)模型主要在基本視覺感知任務上失敗,而非推理任務;(ii)推理時間的擴展效益根據模型容量有很大差異;以及(iii)訓練VL-GenRMs學習判斷能夠大幅提升判斷能力(對於7B VL-GenRM,準確率提高了+14.7%)。我們相信VL-RewardBench以及實驗見解將成為推進VL-GenRMs的寶貴資源。
English
Vision-language generative reward models (VL-GenRMs) play a crucial role in
aligning and evaluating multimodal AI systems, yet their own evaluation remains
under-explored. Current assessment methods primarily rely on AI-annotated
preference labels from traditional VL tasks, which can introduce biases and
often fail to effectively challenge state-of-the-art models. To address these
limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning
general multimodal queries, visual hallucination detection, and complex
reasoning tasks. Through our AI-assisted annotation pipeline combining sample
selection with human verification, we curate 1,250 high-quality examples
specifically designed to probe model limitations. Comprehensive evaluation
across 16 leading large vision-language models, demonstrates VL-RewardBench's
effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4%
accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B,
struggle to surpass random-guessing. Importantly, performance on VL-RewardBench
strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N
sampling with VL-GenRMs. Analysis experiments uncover three critical insights
for improving VL-GenRMs: (i) models predominantly fail at basic visual
perception tasks rather than reasoning tasks; (ii) inference-time scaling
benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to
learn to judge substantially boosts judgment capability (+14.7% accuracy for a
7B VL-GenRM). We believe VL-RewardBench along with the experimental insights
will become a valuable resource for advancing VL-GenRMs.Summary
AI-Generated Summary