VLM（Vision-Language Models）は自律運転に対応していますか？信頼性、データ、およびメトリクスの観点からの実証研究

要旨

最近、Vision-Language Models（VLMs）の進歩により、自律走行において自然言語を通じて解釈可能な運転決定を生成するためにこれらを使用することに関心が集まっています。しかしながら、VLMsが視覚的に基盤があり信頼性があり解釈可能な説明を提供するという仮定は、ほとんど検証されていません。このギャップに対処するために、私たちはVLMの信頼性を評価するために設計されたベンチマークデータセットであるDriveBenchを紹介します。このデータセットは、17の設定（クリーン、破損、テキストのみの入力）をカバーし、19,200フレーム、20,498の質疑応答ペア、3つの質問タイプ、4つの主要な運転タスク、および合計12の一般的なVLMを含んでいます。我々の調査結果によれば、VLMはしばしば、真の視覚的基盤ではなく、特に劣化したまたは欠落した視覚的入力の下では、一般的な知識やテキストの手がかりから派生した合理的な応答を生成します。これらの振る舞いは、データセットの不均衡や不十分な評価メトリックによって隠されており、自律走行などの安全に関わるシナリオにおいて重大なリスクをもたらします。さらに、VLMがマルチモーダルな推論に苦労し、入力の破損に対して過敏であり、パフォーマンスに不一貫性が生じることを観察しています。これらの課題に対処するために、堅牢な視覚的基盤とマルチモーダルな理解を重視する洗練された評価メトリックを提案します。さらに、VLMが破損に対する認識を活用して信頼性を向上させる可能性を強調し、現実世界の自律走行環境でより信頼性が高く解釈可能な意思決定システムを開発するためのロードマップを提供します。このベンチマークツールキットは一般に公開されています。

English

Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs' awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.

VLM（Vision-Language Models）は自律運転に対応していますか？信頼性、データ、およびメトリクスの観点からの実証研究

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

要旨

Summary

Support