VLMs是否準備好應用於自動駕駛?從可靠性、數據和度量的角度進行的實證研究
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
January 7, 2025
作者: Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan
cs.AI
摘要
最近在視覺語言模型(VLMs)方面的進展引起了人們對其在自動駕駛中應用的興趣,特別是通過自然語言生成可解釋的駕駛決策。然而,VLMs固有地提供視覺基礎、可靠和可解釋的駕駛解釋這一假設仍然未經深入研究。為了填補這一空白,我們引入了DriveBench,這是一個旨在評估VLM在17個設置(乾淨、損壞和僅文本輸入)中的可靠性的基準數據集,包括19,200幀、20,498個問答對、三種問題類型、四種主流駕駛任務和總共12個熱門VLMs。我們的研究發現,VLMs通常生成合理的回答,這些回答來自於一般知識或文本線索,而非真正的視覺基礎,特別是在視覺輸入受損或缺失的情況下。這種行為被數據集不平衡和不足的評估指標所掩蓋,對於像自動駕駛這樣的安全關鍵場景構成重大風險。我們進一步觀察到,VLMs在多模態推理方面存在困難,對輸入損壞表現出較高的敏感性,導致性能不一致。為了應對這些挑戰,我們提出了精煉的評估指標,優先考慮強大的視覺基礎和多模態理解。此外,我們強調利用VLMs對損壞的感知潛力,以增強它們的可靠性,提供了在現實世界自動駕駛情境中開發更可信賴和可解釋的決策系統的路徑。這個基準工具包是公開可訪問的。
English
Recent advancements in Vision-Language Models (VLMs) have sparked interest in
their use for autonomous driving, particularly in generating interpretable
driving decisions through natural language. However, the assumption that VLMs
inherently provide visually grounded, reliable, and interpretable explanations
for driving remains largely unexamined. To address this gap, we introduce
DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17
settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames,
20,498 question-answer pairs, three question types, four mainstream driving
tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often
generate plausible responses derived from general knowledge or textual cues
rather than true visual grounding, especially under degraded or missing visual
inputs. This behavior, concealed by dataset imbalances and insufficient
evaluation metrics, poses significant risks in safety-critical scenarios like
autonomous driving. We further observe that VLMs struggle with multi-modal
reasoning and display heightened sensitivity to input corruptions, leading to
inconsistencies in performance. To address these challenges, we propose refined
evaluation metrics that prioritize robust visual grounding and multi-modal
understanding. Additionally, we highlight the potential of leveraging VLMs'
awareness of corruptions to enhance their reliability, offering a roadmap for
developing more trustworthy and interpretable decision-making systems in
real-world autonomous driving contexts. The benchmark toolkit is publicly
accessible.Summary
AI-Generated Summary