視覺語言模型是否真正理解多視覺感應器？

摘要

大規模視覺語言模型（VLMs）通過將視覺輸入與文本對齊，顯著提高了計算機視覺任務的性能。此外，為了有效地應用VLMs於現實應用中，對多樣的多視覺感測器數據（如熱像、深度和X射線信息）的理解至關重要。然而，我們發現當前的VLMs在處理多視覺感測器圖像時，缺乏對感測器信息的深入理解，忽略了每個感測器獨特的物理特性。這一限制限制了它們解釋和回答需要多視覺感測器推理的複雜問題的能力。為了解決這個問題，我們提出了一個新穎的多視覺感測器感知和推理（MS-PR）基準，評估VLMs在感測器特定推理能力上的表現。此外，我們引入了多樣性負面特徵（DNA）優化，使VLMs能夠對多視覺感測器任務進行深入推理，有助於彌合圖像與感測器數據之間的核心信息差距。大量實驗結果證實，所提出的DNA方法可以顯著提高VLMs的多視覺感測器推理能力。

English

Large-scale Vision-Language Models (VLMs) have advanced by aligning vision inputs with text, significantly improving performance in computer vision tasks. Moreover, for VLMs to be effectively utilized in real-world applications, an understanding of diverse multi-vision sensor data, such as thermal, depth, and X-ray information, is essential. However, we find that current VLMs process multi-vision sensor images without deep understanding of sensor information, disregarding each sensor's unique physical properties. This limitation restricts their capacity to interpret and respond to complex questions requiring multi-vision sensor reasoning. To address this, we propose a novel Multi-vision Sensor Perception and Reasoning (MS-PR) benchmark, assessing VLMs on their capacity for sensor-specific reasoning. Moreover, we introduce Diverse Negative Attributes (DNA) optimization to enable VLMs to perform deep reasoning on multi-vision sensor tasks, helping to bridge the core information gap between images and sensor data. Extensive experimental results validate that the proposed DNA method can significantly improve the multi-vision sensor reasoning for VLMs.

視覺語言模型是否真正理解多視覺感應器？

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

摘要

Support