VLMs 是否已经准备好用于自动驾驶?来自可靠性、数据和度量三个角度的实证研究。
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives
January 7, 2025
作者: Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, Liang Pan
cs.AI
摘要
最近视觉语言模型(VLMs)的进展引起了人们对其在自动驾驶中的应用的兴趣,特别是通过自然语言生成可解释的驾驶决策。然而,认为VLMs固有地提供视觉上有根据、可靠和可解释的驾驶解释的假设仍然未经充分检验。为了弥补这一空白,我们引入了DriveBench,一个基准数据集,旨在评估VLM在17个设置(清晰、损坏和仅文本输入)中的可靠性,包括19,200帧、20,498个问答对、三种问题类型、四种主流驾驶任务以及共计12种热门VLMs。我们的研究结果显示,VLMs通常生成的合理回答源自于一般知识或文本线索,而非真正的视觉基础,特别是在受损或缺失视觉输入的情况下。这种行为被数据集不平衡和评估指标不足所掩盖,对于像自动驾驶这样的安全关键场景构成了重大风险。我们进一步观察到,VLMs在多模态推理方面存在困难,并且对输入损坏表现出更高的敏感性,导致性能的不一致性。为了解决这些挑战,我们提出了精细化的评估指标,重视稳健的视觉基础和多模态理解。此外,我们强调利用VLMs对损坏的感知潜力,以增强它们的可靠性,为在真实世界的自动驾驶环境中开发更值得信赖和可解释的决策系统提供了路线图。该基准工具包是公开可访问的。
English
Recent advancements in Vision-Language Models (VLMs) have sparked interest in
their use for autonomous driving, particularly in generating interpretable
driving decisions through natural language. However, the assumption that VLMs
inherently provide visually grounded, reliable, and interpretable explanations
for driving remains largely unexamined. To address this gap, we introduce
DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17
settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames,
20,498 question-answer pairs, three question types, four mainstream driving
tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often
generate plausible responses derived from general knowledge or textual cues
rather than true visual grounding, especially under degraded or missing visual
inputs. This behavior, concealed by dataset imbalances and insufficient
evaluation metrics, poses significant risks in safety-critical scenarios like
autonomous driving. We further observe that VLMs struggle with multi-modal
reasoning and display heightened sensitivity to input corruptions, leading to
inconsistencies in performance. To address these challenges, we propose refined
evaluation metrics that prioritize robust visual grounding and multi-modal
understanding. Additionally, we highlight the potential of leveraging VLMs'
awareness of corruptions to enhance their reliability, offering a roadmap for
developing more trustworthy and interpretable decision-making systems in
real-world autonomous driving contexts. The benchmark toolkit is publicly
accessible.Summary
AI-Generated Summary