视觉-语言模型是否真正理解多视觉传感器?
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
December 30, 2024
作者: Sangyun Chung, Youngjoon Yu, Youngchae Chee, Se Yeon Kim, Byung-Kwan Lee, Yong Man Ro
cs.AI
摘要
大规模视觉语言模型(VLMs)通过将视觉输入与文本对齐,显著提高了计算机视觉任务的性能。此外,为了有效地在现实世界应用中利用VLMs,对多样化的多视觉传感器数据(如热像、深度和X射线信息)的理解至关重要。然而,我们发现当前的VLMs处理多视觉传感器图像时缺乏对传感器信息的深入理解,忽视了每个传感器的独特物理特性。这一局限限制了它们解释和回答需要多视觉传感器推理的复杂问题的能力。为了解决这一问题,我们提出了一种新颖的多视觉传感器感知和推理(MS-PR)基准,评估VLMs在传感器特定推理能力方面的表现。此外,我们引入了多样化负面属性(DNA)优化,使VLMs能够在多视觉传感器任务上进行深入推理,有助于弥合图像和传感器数据之间的核心信息差距。广泛的实验结果验证了所提出的DNA方法可以显著提高VLMs的多视觉传感器推理能力。
English
Large-scale Vision-Language Models (VLMs) have advanced by aligning vision
inputs with text, significantly improving performance in computer vision tasks.
Moreover, for VLMs to be effectively utilized in real-world applications, an
understanding of diverse multi-vision sensor data, such as thermal, depth, and
X-ray information, is essential. However, we find that current VLMs process
multi-vision sensor images without deep understanding of sensor information,
disregarding each sensor's unique physical properties. This limitation
restricts their capacity to interpret and respond to complex questions
requiring multi-vision sensor reasoning. To address this, we propose a novel
Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs
on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse
Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning
on multi-vision sensor tasks, helping to bridge the core information gap
between images and sensor data. Extensive experimental results validate that
the proposed DNA method can significantly improve the multi-vision sensor
reasoning for VLMs.Summary
AI-Generated Summary