ColorBench:視覺語言模型能否看見並理解多彩世界?一個全面的色彩感知、推理與魯棒性基準測試
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
April 10, 2025
作者: Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou
cs.AI
摘要
色彩在人類感知中扮演著重要角色,通常為視覺推理提供關鍵線索。然而,視覺語言模型(VLMs)是否以及如何像人類一樣感知、理解並利用色彩,尚不明確。本文介紹了ColorBench,這是一個精心設計的創新基準,旨在評估VLMs在色彩理解方面的能力,包括色彩感知、推理及魯棒性。通過基於實際應用場景策劃一系列多樣化的測試情境,ColorBench評估這些模型如何感知色彩、從基於色彩的線索中推斷含義,以及在各種色彩變換下保持一致的性能。通過對32個採用不同語言模型和視覺編碼器的VLMs進行廣泛評估,本文揭示了一些未被發現的發現:(i) 規模法則(更大的模型表現更好)在ColorBench上依然成立,而語言模型比視覺編碼器扮演更重要的角色。(ii) 然而,各模型間的性能差距相對較小,表明現有VLMs在很大程度上忽視了色彩理解。(iii) 儘管這些任務以視覺為中心,但CoT推理提高了色彩理解的準確性和魯棒性。(iv) VLMs在ColorBench上確實利用了色彩線索,但在某些任務中色彩線索也可能誤導模型。這些發現凸顯了當前VLMs的關鍵局限性,並強調了增強色彩理解能力的必要性。我們的ColorBench可作為推進多模態AI達到人類水平色彩理解研究的基礎工具。
English
Color plays an important role in human perception and usually provides
critical clues in visual reasoning. However, it is unclear whether and how
vision-language models (VLMs) can perceive, understand, and leverage color as
humans. This paper introduces ColorBench, an innovative benchmark meticulously
crafted to assess the capabilities of VLMs in color understanding, including
color perception, reasoning, and robustness. By curating a suite of diverse
test scenarios, with grounding in real applications, ColorBench evaluates how
these models perceive colors, infer meanings from color-based cues, and
maintain consistent performance under varying color transformations. Through an
extensive evaluation of 32 VLMs with varying language models and vision
encoders, our paper reveals some undiscovered findings: (i) The scaling law
(larger models are better) still holds on ColorBench, while the language model
plays a more important role than the vision encoder. (ii) However, the
performance gaps across models are relatively small, indicating that color
understanding has been largely neglected by existing VLMs. (iii) CoT reasoning
improves color understanding accuracies and robustness, though they are
vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on
ColorBench but they can also mislead models in some tasks. These findings
highlight the critical limitations of current VLMs and underscore the need to
enhance color comprehension. Our ColorBenchcan serve as a foundational tool for
advancing the study of human-level color understanding of multimodal AI.Summary
AI-Generated Summary